This R Markdown sheet includes a quite large set of R code chunks taken from the RHDS book R for Health Data Science by Ewen Harrison and Riinu Pius (CRC Press, 2021), available online at https://argoshare.is.ed.ac.uk/healthyr_book/. The structure follows the structure of the book (chapters 2, 3, 4, 5, 6, and 8 and their sections and subsections).

The goal is to get familiar with R and R Markdown (in a quite quick pace) using typical R functions for data wrangling, visualization, and analysis. You do not need to do very much (yet), mostly just read or browse the book while following and exploring the work flow in RStudio, activating the R functions and studying the output (results, statistics, graphics etc.) that will be created (including some tiny errors that are included on purpose). There are also some small exercises, where you are encouraged to write a bit of R code yourself (based on the hints given in the book). You may well skip those exercises.

The rest of the Exercise sets on this IODS (Introduction to Open Data Science) course have a different, more interactive structure. They have their own logic and detailed instructions, as you will see from the next week onwards.


Chapter 1: Why we love R

You should begin by reading (or browsing through) this chapter:

https://argoshare.is.ed.ac.uk/healthyr_book/why-we-love-r.html

Note1: On IODS course, we focus on writing the steps of the analysis in R Markdown, but we also use R scripts. Both ways of working are included in the RHDS book. Here, our focus is on the R Markdown (R scripts are a bit simpler).

Note2: We have already copy-pasted the R codes for you in this R Markdown sheet, beginning from Chapter 2. We have also included the example datasets of RHDS book through our GitHub repository, so all the datasets of the book are easily accessible on the fly, without a need to download them separately.

The following is the first example of an R code chunk, where you can activate R functions. Its contents have been copy-pasted from Section 1.7. Try now to activate the R “command” (“1001:1004”) by putting the cursor on that line below, and pressing Ctrl+Enter (on Windows) or Cmd+Enter (on Mac)!

# Now we're printing bigger numbers:
1001:1004
## [1] 1001 1002 1003 1004

As you see above, the results (in this case, the numbers 1001, 1002, 1003, and 1004) appear immediately below the R code chunk (preceded by the usual [1], which you will also get used to, very soon.)

OK, as soon as you have read Chapter 1, you are ready to move on, to Chapter 2!


Chapter 2: R basics

Again, the idea is that you read (or at least browse) the contents of the book while following the work flow (the R code chunks that have been created here for you).

So, begin reading at https://argoshare.is.ed.ac.uk/healthyr_book/r-basics.html.

As soon as you see R code in the book, look at your RStudio, in this R Markdown sheet. Activate the R functions one at a time and see what happens.

Have fun! This will get you ready for the rest of the IODS course!

2.1 Reading data into R

NOTE: The function library(…) is used to call another R package to work. When there is a library(…) function involved (as there is in the code chunk below), you must be prepared to install the package (if you have not installed that before). You can do that in the RStudio “Packages” pane on the right, but also by using an R function install.packages(“…”) in this editor. In some cases (like just below), the install.packages(“…”) function is given as a comment to the library call, which is a useful trick: you can select (with mouse or arrow keys) the install.packages(“…”) and run it (by Ctrl+Enter / Cmd+Enter). It is NOT a good idea to leave the install.packages(“…”) function in the code, outside of the comment (similarly as the library, for example), as then the package would be installed unnecessarily often (which takes time and other resources). Usually, installing is done once, and then you can use the package (with library(…)) as many times as you need to.

So try this below: select (with mouse or arrow keys) the install.packages(“…”) and run it (by Ctrl+Enter / Cmd+Enter). It will install the tidyverse package collection that we will need all the time.

After an successful installation (that might take some time), continue by activating the actual functions, beginning from library(…) and then finally reading the data in:

library(tidyverse) # install.packages("tidyverse")
## -- Attaching packages --------------------------------------- tidyverse 1.3.0 --
## v ggplot2 3.3.5     v purrr   0.3.4
## v tibble  3.1.0     v dplyr   1.0.5
## v tidyr   1.1.3     v stringr 1.4.0
## v readr   1.4.0     v forcats 0.5.1
## Warning: package 'ggplot2' was built under R version 4.0.5
## Warning: package 'tibble' was built under R version 4.0.4
## Warning: package 'tidyr' was built under R version 4.0.4
## Warning: package 'dplyr' was built under R version 4.0.4
## -- Conflicts ------------------------------------------ tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()
example_data <- read_csv("https://raw.githubusercontent.com/KimmoVehkalahti/RHDS/master/example_data.csv")
## 
## -- Column specification --------------------------------------------------------
## cols(
##   id = col_character(),
##   group = col_character(),
##   measurement = col_double(),
##   date = col_datetime(format = "")
## )
#View(example_data) # look at the data (and then close that view to return to the editor)

2.1.2 Reading in the Global Burden of Disease example dataset

#library(tidyverse)
gbd_short <- read_csv("https://raw.githubusercontent.com/KimmoVehkalahti/RHDS/master/data/global_burden_disease_cause-year.csv")
## 
## -- Column specification --------------------------------------------------------
## cols(
##   year = col_double(),
##   cause = col_character(),
##   deaths_millions = col_double()
## )
#View(gbd_short)

2.2 Variable types and why we care

Note: Some outputs may differ a bit from the book, because R is constantly developed further. One example is the following. (We have modified the code chunk a bit to reveal the outputs shown in the book.)

#library(tidyverse)
typesdata <- read_csv("https://raw.githubusercontent.com/KimmoVehkalahti/RHDS/master/data/typesdata.csv")
## 
## -- Column specification --------------------------------------------------------
## cols(
##   id = col_character(),
##   group = col_character(),
##   measurement = col_double(),
##   date = col_datetime(format = "")
## )
spec(typesdata)
## cols(
##   id = col_character(),
##   group = col_character(),
##   measurement = col_double(),
##   date = col_datetime(format = "")
## )
typesdata
## # A tibble: 3 x 4
##   id    group     measurement date               
##   <chr> <chr>           <dbl> <dttm>             
## 1 ID1   Control           1.8 2017-01-02 12:00:00
## 2 ID2   Treatment         4.5 2018-02-03 13:00:00
## 3 ID3   Treatment         3.7 2019-03-04 14:00:00
typesdata_faulty <- read_csv("https://raw.githubusercontent.com/KimmoVehkalahti/RHDS/master/data/typesdata_faulty.csv")
## 
## -- Column specification --------------------------------------------------------
## cols(
##   id = col_character(),
##   group = col_character(),
##   measurement = col_character(),
##   date = col_character()
## )
spec(typesdata_faulty)
## cols(
##   id = col_character(),
##   group = col_character(),
##   measurement = col_character(),
##   date = col_character()
## )
typesdata_faulty
## # A tibble: 3 x 4
##   id    group     measurement date           
##   <chr> <chr>     <chr>       <chr>          
## 1 ID1   Control   1.8         02-Jan-17 12:00
## 2 ID2   Treatment 4.5         03-Feb-18 13:00
## 3 ID3   Treatment 3.7 or 3.8  04-Mar-19 14:00

2.2.1 Numeric variables (continuous)

typesdata$measurement %>% mean()
## [1] 3.333333
measurement_mean <- typesdata$measurement %>% mean()

measurement_mean == 3.333333
## [1] FALSE
(0.10 + 0.05) == 0.15
## [1] FALSE
#library(tidyverse)
near(0.10+0.05, 0.15)
## [1] TRUE
near(measurement_mean, 3.333333, 0.000001)
## [1] TRUE

2.2.2 Character variables

#library(tidyverse)
typesdata %>%
  count(group)
## # A tibble: 2 x 2
##   group         n
##   <chr>     <int>
## 1 Control       1
## 2 Treatment     2
typesdata %>%
  count(group, sort = TRUE)
## # A tibble: 2 x 2
##   group         n
##   <chr>     <int>
## 1 Treatment     2
## 2 Control       1
# all ids are unique:
typesdata %>% 
  count(id, sort = TRUE)
## # A tibble: 3 x 2
##   id        n
##   <chr> <int>
## 1 ID1       1
## 2 ID2       1
## 3 ID3       1
# we add in a duplicate row where id = ID3,
# then count again:
typesdata %>% 
  add_row(id = "ID3") %>% 
  count(id, sort = TRUE)
## # A tibble: 3 x 2
##   id        n
##   <chr> <int>
## 1 ID3       2
## 2 ID1       1
## 3 ID2       1

2.2.3 Factor variables (categorical)

(just read - more about this topic will follow later…)

2.2.4 Date/time variables

Remember to install the lubridate package first…

library(lubridate) # install.packages("lubridate") # makes working with dates easier
## Warning: package 'lubridate' was built under R version 4.0.4
## 
## Attaching package: 'lubridate'
## The following objects are masked from 'package:base':
## 
##     date, intersect, setdiff, union
current_datetime <- Sys.time()
current_datetime
## [1] "2022-11-06 11:22:51 EET"
my_datetime <- "2020-12-01 12:00"
my_datetime
## [1] "2020-12-01 12:00"
# error that is explained in the book... (you may put '#' in front of it)
# my_datetime - current_datetime

current_datetime %>% class()
## [1] "POSIXct" "POSIXt"
my_datetime %>% class()
## [1] "character"

2.2.4 continued…

my_datetime_converted <- ymd_hm(my_datetime)
my_datetime_converted
## [1] "2020-12-01 12:00:00 UTC"
# now it will work:
my_datetime_converted - current_datetime
## Time difference of -704.8909 days
my_datesdiff <- my_datetime_converted - current_datetime
my_datesdiff %>% class()
## [1] "difftime"
ymd_hm("2021-01-02 12:00") + my_datesdiff
## [1] "2019-01-28 14:37:08 UTC"
# another error challenge here... (again you may put '#' in front of that afterwards)
# 560/my_datesdiff
560/as.numeric(my_datesdiff)
## [1] -0.7944492

2.2.4 continued…

parse_date_time("12:34 07/Jan'20", "%H:%M %d/%b'%y")
## [1] "2020-01-07 12:34:00 UTC"
Sys.time()
## [1] "2022-11-06 11:22:51 EET"
Sys.time() %>% format("%H:%M on %B-%d (%Y)")
## [1] "11:22 on November-06 (2022)"
Sys.time() %>% format("Happy days, the current time is %H:%M %B-%d (%Y)!")
## [1] "Happy days, the current time is 11:22 November-06 (2022)!"

2.3 Objects and functions

#library(tidyverse)
mydata <- tibble(
  id   = 1:4,
  sex  = c("Male", "Female", "Female", "Male"),
  var1 = c(4, 1, 2, 3),
  var2 = c(NA, 4, 5, NA),
  var3 = c(2, 1, NA, NA)
)

2.3.1 data frame/tibble

2.3.2 Naming objects

mydata
## # A tibble: 4 x 5
##      id sex     var1  var2  var3
##   <int> <chr>  <dbl> <dbl> <dbl>
## 1     1 Male       4    NA     2
## 2     2 Female     1     4     1
## 3     3 Female     2     5    NA
## 4     4 Male       3    NA    NA

2.3.3 Function and its arguments

mydata$var1
## [1] 4 1 2 3
mean(mydata$var1)
## [1] 2.5
mean(mydata$var2)
## [1] NA
mean(mydata$sex)
## Warning in mean.default(mydata$sex): argument is not numeric or logical:
## returning NA
## [1] NA
mean(mydata$var2, na.rm = TRUE)
## [1] 4.5
Sys.time()
## [1] "2022-11-06 11:22:51 EET"

2.3.4 Working with objects

a <- 103

a
## [1] 103
seq(15, 30) # I prefer c(15:30) more Matlab like
##  [1] 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30
example_sequence <- seq(15, 30)

example_sequence <- example_sequence/2

example_sequence
##  [1]  7.5  8.0  8.5  9.0  9.5 10.0 10.5 11.0 11.5 12.0 12.5 13.0 13.5 14.0 14.5
## [16] 15.0

2.3.5 <- and =

mean_result <- mean(mydata$var2, na.rm = TRUE)

2.3.6 Recap: object, function, input, argument

2.4 Pipe - %>%

#library(tidyverse)
mydata$var1 %>% mean()
## [1] 2.5
mean_result <- mydata$var1 %>% mean()

2.4.1 Using . to direct the pipe

mydata %>% 
  lm(var1~var2, data = .)
## 
## Call:
## lm(formula = var1 ~ var2, data = .)
## 
## Coefficients:
## (Intercept)         var2  
##          -3            1

2.5 Operators for filtering data

gbd_short %>% 
  filter(year < 1995)
## # A tibble: 3 x 3
##    year cause                     deaths_millions
##   <dbl> <chr>                               <dbl>
## 1  1990 Communicable diseases               15.4 
## 2  1990 Injuries                             4.25
## 3  1990 Non-communicable diseases           26.7
gbd_short %>% 
  filter(year <= 1995)
## # A tibble: 6 x 3
##    year cause                     deaths_millions
##   <dbl> <chr>                               <dbl>
## 1  1990 Communicable diseases               15.4 
## 2  1990 Injuries                             4.25
## 3  1990 Non-communicable diseases           26.7 
## 4  1995 Communicable diseases               15.1 
## 5  1995 Injuries                             4.53
## 6  1995 Non-communicable diseases           29.3
gbd_short %>% 
  filter(year == 1995)
## # A tibble: 3 x 3
##    year cause                     deaths_millions
##   <dbl> <chr>                               <dbl>
## 1  1995 Communicable diseases               15.1 
## 2  1995 Injuries                             4.53
## 3  1995 Non-communicable diseases           29.3

2.5 continued…

# do you see what is wrong here? (you may fix it or hide it with '#' afterwards)
gbd_short %>% 
 filter(year == 1995)
## # A tibble: 3 x 3
##    year cause                     deaths_millions
##   <dbl> <chr>                               <dbl>
## 1  1995 Communicable diseases               15.1 
## 2  1995 Injuries                             4.53
## 3  1995 Non-communicable diseases           29.3
gbd_short %>% 
  filter(year == 1995 | year == 2017)
## # A tibble: 6 x 3
##    year cause                     deaths_millions
##   <dbl> <chr>                               <dbl>
## 1  1995 Communicable diseases               15.1 
## 2  1995 Injuries                             4.53
## 3  1995 Non-communicable diseases           29.3 
## 4  2017 Communicable diseases               10.4 
## 5  2017 Injuries                             4.47
## 6  2017 Non-communicable diseases           40.9
gbd_short %>% 
  filter(year == max(year) | year == min(year))
## # A tibble: 6 x 3
##    year cause                     deaths_millions
##   <dbl> <chr>                               <dbl>
## 1  1990 Communicable diseases               15.4 
## 2  1990 Injuries                             4.25
## 3  1990 Non-communicable diseases           26.7 
## 4  2017 Communicable diseases               10.4 
## 5  2017 Injuries                             4.47
## 6  2017 Non-communicable diseases           40.9

2.5.1 Worked examples

mydata_year2000 <- gbd_short %>% 
  filter(year == 2000)

mydata_year2000
## # A tibble: 3 x 3
##    year cause                     deaths_millions
##   <dbl> <chr>                               <dbl>
## 1  2000 Communicable diseases               14.8 
## 2  2000 Injuries                             4.56
## 3  2000 Non-communicable diseases           31.0

2.5.1 continued…

new_data_selection <- gbd_short %>% 
  filter((year == 1990 | year == 2013) & cause == "Communicable diseases")

new_data_selection
## # A tibble: 1 x 3
##    year cause                 deaths_millions
##   <dbl> <chr>                           <dbl>
## 1  1990 Communicable diseases            15.4

2.5.1 continued…

# Or we can get rid of the extra brackets around the years
# by moving cause into a new filter on a new line:

new_data_selection <- gbd_short %>% 
  filter(year == 1990 | year == 2013) %>% 
  filter(cause == "Communicable diseases")

new_data_selection
## # A tibble: 1 x 3
##    year cause                 deaths_millions
##   <dbl> <chr>                           <dbl>
## 1  1990 Communicable diseases            15.4

2.5.1 continued…

# Or even better, we can include both in one filter() call, as all
# separate conditions are by default joined with "&":

new_data_selection <- gbd_short %>% 
  filter(year == 1990 | year == 2013,
         cause == "Communicable diseases")

new_data_selection
## # A tibble: 1 x 3
##    year cause                 deaths_millions
##   <dbl> <chr>                           <dbl>
## 1  1990 Communicable diseases            15.4

2.6 The combine function: c()

gbd_short$cause %>% unique()
## [1] "Communicable diseases"     "Injuries"                 
## [3] "Non-communicable diseases"
gbd_short %>% 
  # also filtering for a single year to keep the result concise
  filter(year == 1990) %>% 
  filter(cause == "Communicable diseases" | cause == "Non-communicable diseases")
## # A tibble: 2 x 3
##    year cause                     deaths_millions
##   <dbl> <chr>                               <dbl>
## 1  1990 Communicable diseases                15.4
## 2  1990 Non-communicable diseases            26.7
gbd_short %>% 
  filter(year == 1990) %>% 
  filter(cause %in% c("Communicable diseases", "Non-communicable diseases"))
## # A tibble: 2 x 3
##    year cause                     deaths_millions
##   <dbl> <chr>                               <dbl>
## 1  1990 Communicable diseases                15.4
## 2  1990 Non-communicable diseases            26.7

2.7 Missing values (NAs) and filters

mydata
## # A tibble: 4 x 5
##      id sex     var1  var2  var3
##   <int> <chr>  <dbl> <dbl> <dbl>
## 1     1 Male       4    NA     2
## 2     2 Female     1     4     1
## 3     3 Female     2     5    NA
## 4     4 Male       3    NA    NA
mydata %>% 
  filter(is.na(var2))
## # A tibble: 2 x 5
##      id sex    var1  var2  var3
##   <int> <chr> <dbl> <dbl> <dbl>
## 1     1 Male      4    NA     2
## 2     4 Male      3    NA    NA
mydata %>% 
  filter(!is.na(var2))
## # A tibble: 2 x 5
##      id sex     var1  var2  var3
##   <int> <chr>  <dbl> <dbl> <dbl>
## 1     2 Female     1     4     1
## 2     3 Female     2     5    NA
mydata %>% 
  filter(var2 != 5)
## # A tibble: 1 x 5
##      id sex     var1  var2  var3
##   <int> <chr>  <dbl> <dbl> <dbl>
## 1     2 Female     1     4     1
mydata %>% 
  filter(var2 != 5 | is.na(var2))
## # A tibble: 3 x 5
##      id sex     var1  var2  var3
##   <int> <chr>  <dbl> <dbl> <dbl>
## 1     1 Male       4    NA     2
## 2     2 Female     1     4     1
## 3     4 Male       3    NA    NA

2.7 continued…

subset1 <- mydata %>% 
  filter(var2 == 5)

subset2 <- mydata %>% 
  filter(! var2 == 5)

subset1; subset2
## # A tibble: 1 x 5
##      id sex     var1  var2  var3
##   <int> <chr>  <dbl> <dbl> <dbl>
## 1     3 Female     2     5    NA
## # A tibble: 1 x 5
##      id sex     var1  var2  var3
##   <int> <chr>  <dbl> <dbl> <dbl>
## 1     2 Female     1     4     1
nrow(mydata)
## [1] 4
nrow(subset1)
## [1] 1
nrow(subset2)
## [1] 1
nrow(subset1) + nrow(subset2) == nrow(mydata)
## [1] FALSE
subset3 <- mydata %>% 
  filter(is.na(var2))

nrow(subset1) + nrow(subset2) + nrow(subset3) == nrow(mydata)
## [1] TRUE

2.8 Creating new columns - mutate()

typesdata
## # A tibble: 3 x 4
##   id    group     measurement date               
##   <chr> <chr>           <dbl> <dttm>             
## 1 ID1   Control           1.8 2017-01-02 12:00:00
## 2 ID2   Treatment         4.5 2018-02-03 13:00:00
## 3 ID3   Treatment         3.7 2019-03-04 14:00:00
typesdata$measurement
## [1] 1.8 4.5 3.7
typesdata$measurement/2
## [1] 0.90 2.25 1.85
typesdata %>% 
  mutate(measurement/2)
## # A tibble: 3 x 5
##   id    group     measurement date                `measurement/2`
##   <chr> <chr>           <dbl> <dttm>                        <dbl>
## 1 ID1   Control           1.8 2017-01-02 12:00:00            0.9 
## 2 ID2   Treatment         4.5 2018-02-03 13:00:00            2.25
## 3 ID3   Treatment         3.7 2019-03-04 14:00:00            1.85
typesdata %>% 
  mutate(measurement_half = measurement/2)
## # A tibble: 3 x 5
##   id    group     measurement date                measurement_half
##   <chr> <chr>           <dbl> <dttm>                         <dbl>
## 1 ID1   Control           1.8 2017-01-02 12:00:00             0.9 
## 2 ID2   Treatment         4.5 2018-02-03 13:00:00             2.25
## 3 ID3   Treatment         3.7 2019-03-04 14:00:00             1.85

2.8 continued…

mydata$`Nasty column name`

# or

mydata %>% 
 select(`Nasty column name`)

2.8 continued…

typesdata_modified <- typesdata %>% 
  mutate(measurement_half = measurement/2)

typesdata_modified
## # A tibble: 3 x 5
##   id    group     measurement date                measurement_half
##   <chr> <chr>           <dbl> <dttm>                         <dbl>
## 1 ID1   Control           1.8 2017-01-02 12:00:00             0.9 
## 2 ID2   Treatment         4.5 2018-02-03 13:00:00             2.25
## 3 ID3   Treatment         3.7 2019-03-04 14:00:00             1.85

2.8 continued…

#library(lubridate)
typesdata %>% 
  mutate(reference_date   = ymd_hm("2020-01-01 12:00"),
         dates_difference = reference_date - date) %>% 
  select(date, reference_date, dates_difference)
## # A tibble: 3 x 3
##   date                reference_date      dates_difference
##   <dttm>              <dttm>              <drtn>          
## 1 2017-01-02 12:00:00 2020-01-01 12:00:00 1094.0000 days  
## 2 2018-02-03 13:00:00 2020-01-01 12:00:00  696.9583 days  
## 3 2019-03-04 14:00:00 2020-01-01 12:00:00  302.9167 days
typesdata %>% 
  mutate(mean_measurement = mean(measurement))
## # A tibble: 3 x 5
##   id    group     measurement date                mean_measurement
##   <chr> <chr>           <dbl> <dttm>                         <dbl>
## 1 ID1   Control           1.8 2017-01-02 12:00:00             3.33
## 2 ID2   Treatment         4.5 2018-02-03 13:00:00             3.33
## 3 ID3   Treatment         3.7 2019-03-04 14:00:00             3.33
typesdata %>% 
  mutate(mean_measurement     = mean(measurement)) %>% 
  mutate(measurement_relative = measurement/mean_measurement) %>% 
  select(matches("measurement"))
## # A tibble: 3 x 3
##   measurement mean_measurement measurement_relative
##         <dbl>            <dbl>                <dbl>
## 1         1.8             3.33                 0.54
## 2         4.5             3.33                 1.35
## 3         3.7             3.33                 1.11

2.8.1 Worked example/exercise

typesdata %>% 
  mutate(reference_date   = ymd_hm("2020-01-01 12:00"),
         dates_difference = reference_date - date) %>% 
  mutate(dates_difference = round(dates_difference)) %>% 
  select(matches("date"))
## # A tibble: 3 x 3
##   date                reference_date      dates_difference
##   <dttm>              <dttm>              <drtn>          
## 1 2017-01-02 12:00:00 2020-01-01 12:00:00 1094 days       
## 2 2018-02-03 13:00:00 2020-01-01 12:00:00  697 days       
## 3 2019-03-04 14:00:00 2020-01-01 12:00:00  303 days

2.9 Conditional calculations - if_else()

typesdata %>% 
  mutate(above_threshold = if_else(measurement > 3,
                                   "Above three",
                                   "Below three"))
## # A tibble: 3 x 5
##   id    group     measurement date                above_threshold
##   <chr> <chr>           <dbl> <dttm>              <chr>          
## 1 ID1   Control           1.8 2017-01-02 12:00:00 Below three    
## 2 ID2   Treatment         4.5 2018-02-03 13:00:00 Above three    
## 3 ID3   Treatment         3.7 2019-03-04 14:00:00 Above three

2.10 Create labels - paste()

typesdata %>% 
  mutate(plot_label = paste(id,
                            "was last measured at", date,
                            ", and the value was",    measurement)) %>% 
  select(plot_label)
## # A tibble: 3 x 1
##   plot_label                                                          
##   <chr>                                                               
## 1 ID1 was last measured at 2017-01-02 12:00:00 , and the value was 1.8
## 2 ID2 was last measured at 2018-02-03 13:00:00 , and the value was 4.5
## 3 ID3 was last measured at 2019-03-04 14:00:00 , and the value was 3.7
pastedata <- tibble(year  = c(2007, 2008, 2009),
                   month = c("Jan", "Feb", "March"),
                   day   = c(1, 2, 3))
pastedata
## # A tibble: 3 x 3
##    year month   day
##   <dbl> <chr> <dbl>
## 1  2007 Jan       1
## 2  2008 Feb       2
## 3  2009 March     3

2.10 continued…

pastedata %>% 
  mutate(date = paste(day, month, year, sep = "-"))
## # A tibble: 3 x 4
##    year month   day date        
##   <dbl> <chr> <dbl> <chr>       
## 1  2007 Jan       1 1-Jan-2007  
## 2  2008 Feb       2 2-Feb-2008  
## 3  2009 March     3 3-March-2009
#library(lubridate)

pastedata %>% 
  mutate(date = paste(day, month, year, sep = "-")) %>% 
  mutate(date = dmy(date))
## # A tibble: 3 x 4
##    year month   day date      
##   <dbl> <chr> <dbl> <date>    
## 1  2007 Jan       1 2007-01-01
## 2  2008 Feb       2 2008-02-02
## 3  2009 March     3 2009-03-03

2.11 Joining multiple datasets

#library(tidyverse)
patientdata <- read_csv("https://raw.githubusercontent.com/KimmoVehkalahti/RHDS/master/data/patient_data.csv")
## 
## -- Column specification --------------------------------------------------------
## cols(
##   id = col_double(),
##   sex = col_character(),
##   age = col_double()
## )
patientdata
## # A tibble: 6 x 3
##      id sex      age
##   <dbl> <chr>  <dbl>
## 1     1 Female    24
## 2     2 Male      59
## 3     3 Female    32
## 4     4 Female    84
## 5     5 Male      48
## 6     6 Female    65
labsdata <- read_csv("https://raw.githubusercontent.com/KimmoVehkalahti/RHDS/master/data/labs_data.csv")
## 
## -- Column specification --------------------------------------------------------
## cols(
##   id = col_double(),
##   measurement = col_double()
## )
labsdata
## # A tibble: 4 x 2
##      id measurement
##   <dbl>       <dbl>
## 1     5        3.47
## 2     6        7.31
## 3     8        9.91
## 4     7        6.11
 full_join(patientdata, labsdata)
## Joining, by = "id"
## # A tibble: 8 x 4
##      id sex      age measurement
##   <dbl> <chr>  <dbl>       <dbl>
## 1     1 Female    24       NA   
## 2     2 Male      59       NA   
## 3     3 Female    32       NA   
## 4     4 Female    84       NA   
## 5     5 Male      48        3.47
## 6     6 Female    65        7.31
## 7     8 <NA>      NA        9.91
## 8     7 <NA>      NA        6.11
inner_join(patientdata, labsdata)
## Joining, by = "id"
## # A tibble: 2 x 4
##      id sex      age measurement
##   <dbl> <chr>  <dbl>       <dbl>
## 1     5 Male      48        3.47
## 2     6 Female    65        7.31
 left_join(patientdata, labsdata)
## Joining, by = "id"
## # A tibble: 6 x 4
##      id sex      age measurement
##   <dbl> <chr>  <dbl>       <dbl>
## 1     1 Female    24       NA   
## 2     2 Male      59       NA   
## 3     3 Female    32       NA   
## 4     4 Female    84       NA   
## 5     5 Male      48        3.47
## 6     6 Female    65        7.31
right_join(patientdata, labsdata)
## Joining, by = "id"
## # A tibble: 4 x 4
##      id sex      age measurement
##   <dbl> <chr>  <dbl>       <dbl>
## 1     5 Male      48        3.47
## 2     6 Female    65        7.31
## 3     8 <NA>      NA        9.91
## 4     7 <NA>      NA        6.11

2.11.1 Further notes about joins

patientdata_new <- read_csv("https://raw.githubusercontent.com/KimmoVehkalahti/RHDS/master/data/patient_data_updated.csv")
## 
## -- Column specification --------------------------------------------------------
## cols(
##   id = col_double(),
##   sex = col_character(),
##   age = col_double()
## )
patientdata_new
## # A tibble: 2 x 3
##      id sex      age
##   <dbl> <chr>  <dbl>
## 1     7 Female    38
## 2     8 Male      29
bind_rows(patientdata, patientdata_new)
## # A tibble: 8 x 3
##      id sex      age
##   <dbl> <chr>  <dbl>
## 1     1 Female    24
## 2     2 Male      59
## 3     3 Female    32
## 4     4 Female    84
## 5     5 Male      48
## 6     6 Female    65
## 7     7 Female    38
## 8     8 Male      29
labsdata_updated <- labsdata %>% 
  add_row(id = 5, measurement = 2.49)
labsdata_updated
## # A tibble: 5 x 2
##      id measurement
##   <dbl>       <dbl>
## 1     5        3.47
## 2     6        7.31
## 3     8        9.91
## 4     7        6.11
## 5     5        2.49
left_join(patientdata, labsdata_updated)
## Joining, by = "id"
## # A tibble: 7 x 4
##      id sex      age measurement
##   <dbl> <chr>  <dbl>       <dbl>
## 1     1 Female    24       NA   
## 2     2 Male      59       NA   
## 3     3 Female    32       NA   
## 4     4 Female    84       NA   
## 5     5 Male      48        3.47
## 6     5 Male      48        2.49
## 7     6 Female    65        7.31

Well done! That was active reading of Chapter 2.

Working will continue below with Chapter 3…


Chapter 3: Summarising data

Continue reading at https://argoshare.is.ed.ac.uk/healthyr_book/summarising-data.html and working with the R code chunks.

3.1 Get the data

#library(tidyverse)
gbd_full <- read_csv("https://raw.githubusercontent.com/KimmoVehkalahti/RHDS/master/data/global_burden_disease_cause-year-sex-income.csv")
## 
## -- Column specification --------------------------------------------------------
## cols(
##   cause = col_character(),
##   year = col_double(),
##   sex = col_character(),
##   income = col_character(),
##   deaths_millions = col_double()
## )
# Creating a single-year tibble for printing and simple examples:
gbd2017 <- gbd_full %>% 
  filter(year == 2017)

#View(gbd2017)

3.2 Plot the data

gbd2017 %>% 
  # without the mutate(... = fct_relevel()) 
  # the panels get ordered alphabetically
  mutate(income = fct_relevel(income,
                              "Low",
                              "Lower-Middle",
                              "Upper-Middle",
                              "High")) %>% 
  # defining the variables using ggplot(aes(...)):
  ggplot(aes(x = sex, y = deaths_millions, fill = cause)) +
  # type of geom to be used: column (that's a type of barplot):
  geom_col(position = "dodge") +
  # facets for the income groups:
  facet_wrap(~income, ncol = 4) +
  # move the legend to the top of the plot (default is "right"):
  theme(legend.position = "top")

3.3 Aggregating: group_by(), summarise()

gbd2017$deaths_millions %>% sum()
## [1] 55.74
gbd2017 %>% 
  summarise(sum(deaths_millions))
## # A tibble: 1 x 1
##   `sum(deaths_millions)`
##                    <dbl>
## 1                   55.7
gbd2017 %>% 
  group_by(cause) %>% 
  summarise(sum(deaths_millions))
## # A tibble: 3 x 2
##   cause                     `sum(deaths_millions)`
##   <chr>                                      <dbl>
## 1 Communicable diseases                      10.4 
## 2 Injuries                                    4.47
## 3 Non-communicable diseases                  40.9
gbd2017 %>% 
  group_by(cause, sex) %>% 
  summarise(sum(deaths_millions))
## `summarise()` has grouped output by 'cause'. You can override using the
## `.groups` argument.
## # A tibble: 6 x 3
## # Groups:   cause [3]
##   cause                     sex    `sum(deaths_millions)`
##   <chr>                     <chr>                   <dbl>
## 1 Communicable diseases     Female                   4.91
## 2 Communicable diseases     Male                     5.47
## 3 Injuries                  Female                   1.42
## 4 Injuries                  Male                     3.05
## 5 Non-communicable diseases Female                  19.2 
## 6 Non-communicable diseases Male                    21.7

3.4 Add new columns: mutate()

gbd2017 %>% 
  group_by(cause, sex) %>% 
  summarise(deaths_per_group = sum(deaths_millions)) %>% 
  ungroup() %>% 
  mutate(deaths_total = sum(deaths_per_group))
## `summarise()` has grouped output by 'cause'. You can override using the
## `.groups` argument.
## # A tibble: 6 x 4
##   cause                     sex    deaths_per_group deaths_total
##   <chr>                     <chr>             <dbl>        <dbl>
## 1 Communicable diseases     Female             4.91         55.7
## 2 Communicable diseases     Male               5.47         55.7
## 3 Injuries                  Female             1.42         55.7
## 4 Injuries                  Male               3.05         55.7
## 5 Non-communicable diseases Female            19.2          55.7
## 6 Non-communicable diseases Male              21.7          55.7

3.4.1 Percentages formatting: percent()

# percent() function for formatting percentages come from library(scales)
library(scales)
## Warning: package 'scales' was built under R version 4.0.5
## 
## Attaching package: 'scales'
## The following object is masked from 'package:purrr':
## 
##     discard
## The following object is masked from 'package:readr':
## 
##     col_factor
gbd2017_summarised <- gbd2017 %>% 
  group_by(cause, sex) %>% 
  summarise(deaths_per_group = sum(deaths_millions)) %>% 
  ungroup() %>% 
  mutate(deaths_total    = sum(deaths_per_group),
         deaths_relative = percent(deaths_per_group/deaths_total))
## `summarise()` has grouped output by 'cause'. You can override using the
## `.groups` argument.
gbd2017_summarised
## # A tibble: 6 x 5
##   cause                     sex    deaths_per_group deaths_total deaths_relative
##   <chr>                     <chr>             <dbl>        <dbl> <chr>          
## 1 Communicable diseases     Female             4.91         55.7 8.8%           
## 2 Communicable diseases     Male               5.47         55.7 9.8%           
## 3 Injuries                  Female             1.42         55.7 2.5%           
## 4 Injuries                  Male               3.05         55.7 5.5%           
## 5 Non-communicable diseases Female            19.2          55.7 34.4%          
## 6 Non-communicable diseases Male              21.7          55.7 39.0%
# using values from the first row as an example:
round(100*4.91/55.74, 1) %>% paste0("%")
## [1] "8.8%"
gbd2017_summarised %>% 
  mutate(deaths_relative = deaths_per_group/deaths_total)
## # A tibble: 6 x 5
##   cause                     sex    deaths_per_group deaths_total deaths_relative
##   <chr>                     <chr>             <dbl>        <dbl>           <dbl>
## 1 Communicable diseases     Female             4.91         55.7          0.0881
## 2 Communicable diseases     Male               5.47         55.7          0.0981
## 3 Injuries                  Female             1.42         55.7          0.0255
## 4 Injuries                  Male               3.05         55.7          0.0547
## 5 Non-communicable diseases Female            19.2          55.7          0.344 
## 6 Non-communicable diseases Male              21.7          55.7          0.390

3.5 summarise() vs mutate()

gbd_summarised <- gbd2017 %>% 
  group_by(cause, sex) %>% 
  summarise(deaths_per_group = sum(deaths_millions)) %>% 
  arrange(sex)
## `summarise()` has grouped output by 'cause'. You can override using the
## `.groups` argument.
gbd_summarised
## # A tibble: 6 x 3
## # Groups:   cause [3]
##   cause                     sex    deaths_per_group
##   <chr>                     <chr>             <dbl>
## 1 Communicable diseases     Female             4.91
## 2 Injuries                  Female             1.42
## 3 Non-communicable diseases Female            19.2 
## 4 Communicable diseases     Male               5.47
## 5 Injuries                  Male               3.05
## 6 Non-communicable diseases Male              21.7
gbd_summarised_sex <- gbd_summarised %>% 
  group_by(sex) %>% 
  summarise(deaths_per_sex = sum(deaths_per_group))

gbd_summarised_sex
## # A tibble: 2 x 2
##   sex    deaths_per_sex
##   <chr>           <dbl>
## 1 Female           25.5
## 2 Male             30.3

3.5 continued…

full_join(gbd_summarised, gbd_summarised_sex)
## Joining, by = "sex"
## # A tibble: 6 x 4
## # Groups:   cause [3]
##   cause                     sex    deaths_per_group deaths_per_sex
##   <chr>                     <chr>             <dbl>          <dbl>
## 1 Communicable diseases     Female             4.91           25.5
## 2 Injuries                  Female             1.42           25.5
## 3 Non-communicable diseases Female            19.2            25.5
## 4 Communicable diseases     Male               5.47           30.3
## 5 Injuries                  Male               3.05           30.3
## 6 Non-communicable diseases Male              21.7            30.3
gbd_summarised %>% 
  group_by(sex) %>% 
  mutate(deaths_per_sex = sum(deaths_per_group))
## # A tibble: 6 x 4
## # Groups:   sex [2]
##   cause                     sex    deaths_per_group deaths_per_sex
##   <chr>                     <chr>             <dbl>          <dbl>
## 1 Communicable diseases     Female             4.91           25.5
## 2 Injuries                  Female             1.42           25.5
## 3 Non-communicable diseases Female            19.2            25.5
## 4 Communicable diseases     Male               5.47           30.3
## 5 Injuries                  Male               3.05           30.3
## 6 Non-communicable diseases Male              21.7            30.3
gbd2017 %>% 
  group_by(cause, sex) %>% 
  summarise(deaths_per_group = sum(deaths_millions)) %>% 
  group_by(sex) %>% 
  mutate(deaths_per_sex  = sum(deaths_per_group),
         sex_cause_perc = percent(deaths_per_group/deaths_per_sex)) %>% 
  arrange(sex, deaths_per_group)
## `summarise()` has grouped output by 'cause'. You can override using the
## `.groups` argument.
## # A tibble: 6 x 5
## # Groups:   sex [2]
##   cause                     sex   deaths_per_group deaths_per_sex sex_cause_perc
##   <chr>                     <chr>            <dbl>          <dbl> <chr>         
## 1 Injuries                  Fema~             1.42           25.5 6%            
## 2 Communicable diseases     Fema~             4.91           25.5 19%           
## 3 Non-communicable diseases Fema~            19.2            25.5 75%           
## 4 Injuries                  Male              3.05           30.3 10.1%         
## 5 Communicable diseases     Male              5.47           30.3 18.1%         
## 6 Non-communicable diseases Male             21.7            30.3 71.8%

3.6 Common arithmetic functions - sum(), mean(), median(), etc.

mynumbers <- c(1, 2, NA)
sum(mynumbers)
## [1] NA
sum(mynumbers, na.rm = TRUE)
## [1] 3

3.7 select() columns

gbd_2rows <- gbd_full %>% 
  slice(1:2)

gbd_2rows
## # A tibble: 2 x 5
##   cause                  year sex    income       deaths_millions
##   <chr>                 <dbl> <chr>  <chr>                  <dbl>
## 1 Communicable diseases  1990 Female High                    0.21
## 2 Communicable diseases  1990 Female Upper-Middle            1.15
gbd_2rows %>% 
  select(cause, deaths_millions)
## # A tibble: 2 x 2
##   cause                 deaths_millions
##   <chr>                           <dbl>
## 1 Communicable diseases            0.21
## 2 Communicable diseases            1.15
gbd_2rows %>% 
  select(cause, deaths = deaths_millions)
## # A tibble: 2 x 2
##   cause                 deaths
##   <chr>                  <dbl>
## 1 Communicable diseases   0.21
## 2 Communicable diseases   1.15
gbd_2rows %>% 
  rename(deaths = deaths_millions)
## # A tibble: 2 x 5
##   cause                  year sex    income       deaths
##   <chr>                 <dbl> <chr>  <chr>         <dbl>
## 1 Communicable diseases  1990 Female High           0.21
## 2 Communicable diseases  1990 Female Upper-Middle   1.15
gbd_2rows %>% 
  select(year, sex, income, cause, deaths_millions)
## # A tibble: 2 x 5
##    year sex    income       cause                 deaths_millions
##   <dbl> <chr>  <chr>        <chr>                           <dbl>
## 1  1990 Female High         Communicable diseases            0.21
## 2  1990 Female Upper-Middle Communicable diseases            1.15
gbd_2rows %>% 
  select(year, sex, everything())
## # A tibble: 2 x 5
##    year sex    cause                 income       deaths_millions
##   <dbl> <chr>  <chr>                 <chr>                  <dbl>
## 1  1990 Female Communicable diseases High                    0.21
## 2  1990 Female Communicable diseases Upper-Middle            1.15
gbd_2rows %>% 
  select(starts_with("deaths"))
## # A tibble: 2 x 1
##   deaths_millions
##             <dbl>
## 1            0.21
## 2            1.15

3.8 Reshaping data - long vs wide format

gbd_wide <- read_csv("https://raw.githubusercontent.com/KimmoVehkalahti/RHDS/master/data/global_burden_disease_wide-format.csv")
## 
## -- Column specification --------------------------------------------------------
## cols(
##   cause = col_character(),
##   Female_1990 = col_double(),
##   Female_2017 = col_double(),
##   Male_1990 = col_double(),
##   Male_2017 = col_double()
## )
gbd_long <- read_csv("https://raw.githubusercontent.com/KimmoVehkalahti/RHDS/master/data/global_burden_disease_cause-year-sex.csv")
## 
## -- Column specification --------------------------------------------------------
## cols(
##   cause = col_character(),
##   year = col_double(),
##   sex = col_character(),
##   deaths_millions = col_double()
## )
gbd_wide
## # A tibble: 3 x 5
##   cause                     Female_1990 Female_2017 Male_1990 Male_2017
##   <chr>                           <dbl>       <dbl>     <dbl>     <dbl>
## 1 Communicable diseases            7.3         4.91      8.06      5.47
## 2 Injuries                         1.41        1.42      2.84      3.05
## 3 Non-communicable diseases       12.8        19.2      13.9      21.7
gbd_long
## # A tibble: 12 x 4
##    cause                      year sex    deaths_millions
##    <chr>                     <dbl> <chr>            <dbl>
##  1 Communicable diseases      1990 Female            7.3 
##  2 Communicable diseases      2017 Female            4.91
##  3 Communicable diseases      1990 Male              8.06
##  4 Communicable diseases      2017 Male              5.47
##  5 Injuries                   1990 Female            1.41
##  6 Injuries                   2017 Female            1.42
##  7 Injuries                   1990 Male              2.84
##  8 Injuries                   2017 Male              3.05
##  9 Non-communicable diseases  1990 Female           12.8 
## 10 Non-communicable diseases  2017 Female           19.2 
## 11 Non-communicable diseases  1990 Male             13.9 
## 12 Non-communicable diseases  2017 Male             21.7

3.8.1 Pivot values from rows into columns (wider)

gbd_long %>% 
  pivot_wider(names_from = year, values_from = deaths_millions)
## # A tibble: 6 x 4
##   cause                     sex    `1990` `2017`
##   <chr>                     <chr>   <dbl>  <dbl>
## 1 Communicable diseases     Female   7.3    4.91
## 2 Communicable diseases     Male     8.06   5.47
## 3 Injuries                  Female   1.41   1.42
## 4 Injuries                  Male     2.84   3.05
## 5 Non-communicable diseases Female  12.8   19.2 
## 6 Non-communicable diseases Male    13.9   21.7
gbd_long %>% 
  pivot_wider(names_from = sex, values_from = deaths_millions) %>% 
  mutate(Male - Female)
## # A tibble: 6 x 5
##   cause                      year Female  Male `Male - Female`
##   <chr>                     <dbl>  <dbl> <dbl>           <dbl>
## 1 Communicable diseases      1990   7.3   8.06           0.760
## 2 Communicable diseases      2017   4.91  5.47           0.560
## 3 Injuries                   1990   1.41  2.84           1.43 
## 4 Injuries                   2017   1.42  3.05           1.63 
## 5 Non-communicable diseases  1990  12.8  13.9            1.11 
## 6 Non-communicable diseases  2017  19.2  21.7            2.59
gbd_long %>% 
  pivot_wider(names_from = c(sex, year), values_from = deaths_millions)
## # A tibble: 3 x 5
##   cause                     Female_1990 Female_2017 Male_1990 Male_2017
##   <chr>                           <dbl>       <dbl>     <dbl>     <dbl>
## 1 Communicable diseases            7.3         4.91      8.06      5.47
## 2 Injuries                         1.41        1.42      2.84      3.05
## 3 Non-communicable diseases       12.8        19.2      13.9      21.7

3.8.2 Pivot values from columns to rows (longer)

gbd_wide %>% 
  pivot_longer(matches("Female|Male"), 
               names_to = "sex_year", 
               values_to = "deaths_millions") %>% 
  slice(1:6)
## # A tibble: 6 x 3
##   cause                 sex_year    deaths_millions
##   <chr>                 <chr>                 <dbl>
## 1 Communicable diseases Female_1990            7.3 
## 2 Communicable diseases Female_2017            4.91
## 3 Communicable diseases Male_1990              8.06
## 4 Communicable diseases Male_2017              5.47
## 5 Injuries              Female_1990            1.41
## 6 Injuries              Female_2017            1.42
gbd_wide %>% 
  select(matches("Female|Male"))
## # A tibble: 3 x 4
##   Female_1990 Female_2017 Male_1990 Male_2017
##         <dbl>       <dbl>     <dbl>     <dbl>
## 1        7.3         4.91      8.06      5.47
## 2        1.41        1.42      2.84      3.05
## 3       12.8        19.2      13.9      21.7

3.8.3 separate() a column into multiple columns

gbd_wide %>% 
  # same pivot_longer as before
  pivot_longer(matches("Female|Male"), 
               names_to = "sex_year", 
               values_to = "deaths_millions") %>% 
  separate(sex_year, into = c("sex", "year"), sep = "_", convert = TRUE)
## # A tibble: 12 x 4
##    cause                     sex     year deaths_millions
##    <chr>                     <chr>  <int>           <dbl>
##  1 Communicable diseases     Female  1990            7.3 
##  2 Communicable diseases     Female  2017            4.91
##  3 Communicable diseases     Male    1990            8.06
##  4 Communicable diseases     Male    2017            5.47
##  5 Injuries                  Female  1990            1.41
##  6 Injuries                  Female  2017            1.42
##  7 Injuries                  Male    1990            2.84
##  8 Injuries                  Male    2017            3.05
##  9 Non-communicable diseases Female  1990           12.8 
## 10 Non-communicable diseases Female  2017           19.2 
## 11 Non-communicable diseases Male    1990           13.9 
## 12 Non-communicable diseases Male    2017           21.7

3.9 arrange() rows

gbd_long %>% 
  arrange(deaths_millions) %>% 
  # first 3 rows just for printing:
  slice(1:3)
## # A tibble: 3 x 4
##   cause     year sex    deaths_millions
##   <chr>    <dbl> <chr>            <dbl>
## 1 Injuries  1990 Female            1.41
## 2 Injuries  2017 Female            1.42
## 3 Injuries  1990 Male              2.84
gbd_long %>% 
  arrange(-deaths_millions) %>% 
  slice(1:3)
## # A tibble: 3 x 4
##   cause                      year sex    deaths_millions
##   <chr>                     <dbl> <chr>            <dbl>
## 1 Non-communicable diseases  2017 Male              21.7
## 2 Non-communicable diseases  2017 Female            19.2
## 3 Non-communicable diseases  1990 Male              13.9
gbd_long %>% 
  arrange(desc(sex)) %>% 
  # printing rows 1, 2, 11, and 12
  slice(1,2, 11, 12)
## # A tibble: 4 x 4
##   cause                      year sex    deaths_millions
##   <chr>                     <dbl> <chr>            <dbl>
## 1 Communicable diseases      1990 Male              8.06
## 2 Communicable diseases      2017 Male              5.47
## 3 Non-communicable diseases  1990 Female           12.8 
## 4 Non-communicable diseases  2017 Female           19.2

3.9.1 Factor levels

gbd_factored <- gbd_long %>% 
  mutate(cause = factor(cause))

gbd_factored$cause %>% levels()
## [1] "Communicable diseases"     "Injuries"                 
## [3] "Non-communicable diseases"
gbd_factored <- gbd_factored %>% 
  mutate(cause = cause %>% 
           fct_relevel("Injuries"))

gbd_factored$cause %>% levels()
## [1] "Injuries"                  "Communicable diseases"    
## [3] "Non-communicable diseases"

3.10.1 Exercise - pivot_wider()

Look at Table 3.4 on page https://argoshare.is.ed.ac.uk/healthyr_book/exercises.html

gbd_long <- read_csv("https://raw.githubusercontent.com/KimmoVehkalahti/RHDS/master/data/global_burden_disease_cause-year-sex.csv")
## 
## -- Column specification --------------------------------------------------------
## cols(
##   cause = col_character(),
##   year = col_double(),
##   sex = col_character(),
##   deaths_millions = col_double()
## )
# Solution:

gbd_long %>% 
  pivot_wider(names_from = cause, values_from = deaths_millions)
## # A tibble: 4 x 5
##    year sex    `Communicable diseases` Injuries `Non-communicable diseases`
##   <dbl> <chr>                    <dbl>    <dbl>                       <dbl>
## 1  1990 Female                    7.3      1.41                        12.8
## 2  2017 Female                    4.91     1.42                        19.2
## 3  1990 Male                      8.06     2.84                        13.9
## 4  2017 Male                      5.47     3.05                        21.7

3.10.2 Exercise - group_by(), summarise()

gbd_full <- read_csv("https://raw.githubusercontent.com/KimmoVehkalahti/RHDS/master/data/global_burden_disease_cause-year-sex-income.csv")
## 
## -- Column specification --------------------------------------------------------
## cols(
##   cause = col_character(),
##   year = col_double(),
##   sex = col_character(),
##   income = col_character(),
##   deaths_millions = col_double()
## )
glimpse(gbd_full)
## Rows: 168
## Columns: 5
## $ cause           <chr> "Communicable diseases", "Communicable diseases", "Com~
## $ year            <dbl> 1990, 1990, 1990, 1990, 1990, 1990, 1990, 1990, 1990, ~
## $ sex             <chr> "Female", "Female", "Female", "Female", "Male", "Male"~
## $ income          <chr> "High", "Upper-Middle", "Lower-Middle", "Low", "High",~
## $ deaths_millions <dbl> 0.21, 1.15, 4.43, 1.51, 0.26, 1.35, 4.73, 1.72, 0.20, ~
summary_data1 <- 
  gbd_full %>% 
  group_by(year) %>% 
  summarise(total_per_year = sum(deaths_millions))

summary_data1
## # A tibble: 7 x 2
##    year total_per_year
##   <dbl>          <dbl>
## 1  1990           46.3
## 2  1995           48.9
## 3  2000           50.4
## 4  2005           51.2
## 5  2010           52.6
## 6  2015           54.6
## 7  2017           55.7
summary_data2 <- 
  gbd_full %>% 
  group_by(year, cause) %>% 
  summarise(total_per_cause = sum(deaths_millions))
## `summarise()` has grouped output by 'year'. You can override using the `.groups`
## argument.
summary_data2
## # A tibble: 21 x 3
## # Groups:   year [7]
##     year cause                     total_per_cause
##    <dbl> <chr>                               <dbl>
##  1  1990 Communicable diseases               15.4 
##  2  1990 Injuries                             4.25
##  3  1990 Non-communicable diseases           26.7 
##  4  1995 Communicable diseases               15.1 
##  5  1995 Injuries                             4.53
##  6  1995 Non-communicable diseases           29.3 
##  7  2000 Communicable diseases               14.8 
##  8  2000 Injuries                             4.56
##  9  2000 Non-communicable diseases           31.0 
## 10  2005 Communicable diseases               13.9 
## # ... with 11 more rows

3.10.3 Exercise - full_join(), percent()

# Solution:

library(scales)
full_join(summary_data1, summary_data2) %>% 
  mutate(percentage = percent(total_per_cause/total_per_year)) 
## Joining, by = "year"
## # A tibble: 21 x 5
##     year total_per_year cause                     total_per_cause percentage
##    <dbl>          <dbl> <chr>                               <dbl> <chr>     
##  1  1990           46.3 Communicable diseases               15.4  33.161%   
##  2  1990           46.3 Injuries                             4.25 9.175%    
##  3  1990           46.3 Non-communicable diseases           26.7  57.664%   
##  4  1995           48.9 Communicable diseases               15.1  30.893%   
##  5  1995           48.9 Injuries                             4.53 9.262%    
##  6  1995           48.9 Non-communicable diseases           29.3  59.845%   
##  7  2000           50.4 Communicable diseases               14.8  29.397%   
##  8  2000           50.4 Injuries                             4.56 9.051%    
##  9  2000           50.4 Non-communicable diseases           31.0  61.552%   
## 10  2005           51.2 Communicable diseases               13.9  27.102%   
## # ... with 11 more rows

3.10.4 Exercise - mutate(), summarise()

# Solution:

gbd_full %>% 
  # aggregate to deaths per cause per year using summarise()
  group_by(year, cause) %>% 
  summarise(total_per_cause = sum(deaths_millions)) %>% 
  # then add a column of yearly totals using mutate()
  group_by(year) %>% 
  mutate(total_per_year = sum(total_per_cause)) %>% 
  # add the percentage column
  mutate(percentage = percent(total_per_cause/total_per_year)) %>% 
  # select the final variables for better viewing
  select(year, cause, percentage) %>% 
  pivot_wider(names_from = cause, values_from = percentage)
## `summarise()` has grouped output by 'year'. You can override using the `.groups`
## argument.
## # A tibble: 7 x 4
## # Groups:   year [7]
##    year `Communicable diseases` Injuries `Non-communicable diseases`
##   <dbl> <chr>                   <chr>    <chr>                      
## 1  1990 33%                     9%       58%                        
## 2  1995 31%                     9%       60%                        
## 3  2000 29%                     9%       62%                        
## 4  2005 27%                     9%       64%                        
## 5  2010 24%                     9%       67%                        
## 6  2015 20%                     8%       72%                        
## 7  2017 19%                     8%       73%

3.10.5 Exercise - filter(), summarise(), pivot_wider()

# Solution:

gbd_full %>% 
  filter(year == 1990) %>% 
  group_by(income, sex) %>% 
  summarise(total_deaths = sum(deaths_millions)) %>% 
  pivot_wider(names_from = income, values_from = total_deaths)
## `summarise()` has grouped output by 'income'. You can override using the
## `.groups` argument.
## # A tibble: 2 x 5
##   sex     High   Low `Lower-Middle` `Upper-Middle`
##   <chr>  <dbl> <dbl>          <dbl>          <dbl>
## 1 Female  4.14  2.22           8.47           6.68
## 2 Male    4.46  2.57           9.83           7.95

Wow! That was all for Chapter 3. GOOD JOB.

The next chapter will take us in plotting awesome graphs. Let’s proceed!


Chapter 4: Different types of plots

Look at Figure 4.1 at https://argoshare.is.ed.ac.uk/healthyr_book/chap04-h1.html.

You will now re-create the figure step by step!

4.1 Get the data

#library(tidyverse)
library(gapminder) # install.packages("gapminder")
## Warning: package 'gapminder' was built under R version 4.0.5
glimpse(gapminder)
## Rows: 1,704
## Columns: 6
## $ country   <fct> "Afghanistan", "Afghanistan", "Afghanistan", "Afghanistan", ~
## $ continent <fct> Asia, Asia, Asia, Asia, Asia, Asia, Asia, Asia, Asia, Asia, ~
## $ year      <int> 1952, 1957, 1962, 1967, 1972, 1977, 1982, 1987, 1992, 1997, ~
## $ lifeExp   <dbl> 28.801, 30.332, 31.997, 34.020, 36.088, 38.438, 39.854, 40.8~
## $ pop       <int> 8425333, 9240934, 10267083, 11537966, 13079460, 14880372, 12~
## $ gdpPercap <dbl> 779.4453, 820.8530, 853.1007, 836.1971, 739.9811, 786.1134, ~
gapminder$year %>% unique()
##  [1] 1952 1957 1962 1967 1972 1977 1982 1987 1992 1997 2002 2007
gapminder$country %>% n_distinct()
## [1] 142
gapminder$continent %>% unique()
## [1] Asia     Europe   Africa   Americas Oceania 
## Levels: Africa Americas Asia Europe Oceania
gapdata2007 <- gapminder %>% 
  filter(year == 2007)

gapdata2007
## # A tibble: 142 x 6
##    country     continent  year lifeExp       pop gdpPercap
##    <fct>       <fct>     <int>   <dbl>     <int>     <dbl>
##  1 Afghanistan Asia       2007    43.8  31889923      975.
##  2 Albania     Europe     2007    76.4   3600523     5937.
##  3 Algeria     Africa     2007    72.3  33333216     6223.
##  4 Angola      Africa     2007    42.7  12420476     4797.
##  5 Argentina   Americas   2007    75.3  40301927    12779.
##  6 Australia   Oceania    2007    81.2  20434176    34435.
##  7 Austria     Europe     2007    79.8   8199783    36126.
##  8 Bahrain     Asia       2007    75.6    708573    29796.
##  9 Bangladesh  Asia       2007    64.1 150448339     1391.
## 10 Belgium     Europe     2007    79.4  10392226    33693.
## # ... with 132 more rows
# loads the gapminder dataset from the package environment
# into your Global Environment
gapdata <- gapminder

4.2 Anatomy of ggplot explained

# recommended form:
library(plotly)
## Warning: package 'plotly' was built under R version 4.0.5
## 
## Attaching package: 'plotly'
## The following object is masked from 'package:ggplot2':
## 
##     last_plot
## The following object is masked from 'package:stats':
## 
##     filter
## The following object is masked from 'package:graphics':
## 
##     layout
plot = gapdata2007 %>% 
  ggplot(aes(x = gdpPercap, y = lifeExp, color = continent, label = country)) +
  geom_point()

ggplotly(plot)
# NOT recommended form:
ggplot(gapdata2007, aes(x = gdpPercap, y = lifeExp))

# just a schematic example of using the pipe:
#
# data %>% 
#   filter(...) %>% 
#   mutate(...) %>% 
#   ggplot(aes(...)) +
#   ...

4.2 continued…

gapdata2007 %>% 
  ggplot(aes(x = gdpPercap, y = lifeExp)) +
  geom_point()

gapdata2007 %>% 
  ggplot(aes(x = continent, y = lifeExp)) +
  geom_point()

gapdata2007 %>% 
  ggplot(aes(x = gdpPercap, y = lifeExp, colour = continent)) +
  geom_point()

gapdata2007 %>% 
  ggplot(aes(x = gdpPercap, y = lifeExp, colour = continent)) +
  geom_point(shape = 1)

4.2 continued…

gapdata2007 %>% 
  ggplot(aes(x = gdpPercap, y = lifeExp, colour = continent)) +
  geom_point(shape = 1) +
  facet_wrap(~continent)

gapdata2007 %>% 
  ggplot(aes(x = gdpPercap, y = lifeExp, colour = continent)) +
  geom_point(shape = 1) +
  facet_wrap(~pop > 50000000)

gapdata2007 %>% 
  ggplot(aes(x = gdpPercap/1000, y = lifeExp, colour = continent)) +
  geom_point(shape = 1) +
  facet_wrap(~continent) +
  theme_bw()

gapdata %>% group_by(continent,year) %>% summarize(lifeExp = mean(lifeExp)) %>%  ggplot() +
  aes(x = year, y = lifeExp, color = continent) +
  geom_point()
## `summarise()` has grouped output by 'continent'. You can override using the
## `.groups` argument.

4.3 Set your theme - grey vs white

theme_set(theme_bw())

library(tidyverse)
theme_set(theme_bw())

4.4 Scatter plots/bubble plots

gapdata2007 %>% 
  ggplot(aes(x = gdpPercap/1000, y = lifeExp, size = pop)) +
  geom_point()

gapdata2007 %>% 
  ggplot(aes(x = gdpPercap/1000, y = lifeExp, size = pop)) +
  geom_point(shape = 1, alpha = 0.5)

4.5 Line plots/time series plots

gapdata %>% 
  filter(country == "United Kingdom") %>% 
  ggplot(aes(x = year, y = lifeExp)) +
  geom_line()

gapdata %>% 
  ggplot(aes(x = year, y = lifeExp)) +
  geom_line()

gapdata %>% 
  ggplot(aes(x = year, y = lifeExp, group = country)) +
  geom_line()

4.5.1 Exercise

Look at Figure 4.9 on page https://argoshare.is.ed.ac.uk/healthyr_book/line-plotstime-series-plots.html

gapdata %>% 
  ggplot(aes(x = year, y = lifeExp, group = country, color = continent)) +
  geom_line() + facet_wrap(~continent) + scale_colour_brewer(palette = "Paired")

Try to transform the above graph following these instructions:

  • Colour lines by continents: add colour = continent inside aes();
  • Continents on separate facets: + facet_wrap(~continent);
  • Use a nicer colour scheme: + scale_colour_brewer(palette = “Paired”)

4.6 Bar plots

4.6.1 Summarised data

gapdata2007 %>% 
  filter(country %in% c("United Kingdom", "France", "Germany")) %>% 
  ggplot(aes(x = country, y = lifeExp)) +
  geom_col() 

gapdata2007 %>% 
  ggplot(aes(x = continent, y = lifeExp, fill = country)) +
  geom_col(position="dodge") + theme(legend.position = "none") 

4.6.2 Countable data

gapdata2007 %>% 
  count(continent)
## # A tibble: 5 x 2
##   continent     n
##   <fct>     <int>
## 1 Africa       52
## 2 Americas     25
## 3 Asia         33
## 4 Europe       30
## 5 Oceania       2
gapdata2007 %>% 
  ggplot(aes(x = continent)) +
  geom_bar()

gapdata2007 %>% 
  ggplot(aes(x = continent, colour = country)) +
  geom_bar(fill = NA) +
  theme(legend.position = "none")

4.6.3 colour vs fill

4.6.4 Proportions

gapdata2007 %>% 
  ggplot(aes(x = "Global", fill = continent)) + 
  geom_bar()

4.6.5 Exercise

Look at Figure 4.13 on the page https://argoshare.is.ed.ac.uk/healthyr_book/bar-plots.html and try to recreate it!

gapdata2007 %>% filter(continent == "Europe") %>%
  mutate(country = fct_reorder(country,lifeExp)) %>% 
  ggplot() + 
  aes(x = lifeExp, y = country) + 
  theme_classic() +
  geom_col(colour = "deepskyblue", fill = NA)

4.7 Histograms

gapdata2007 %>% 
  ggplot(aes(x = lifeExp)) +
  geom_histogram(binwidth = 10)

4.8 Box plots

gapdata2007 %>% 
  ggplot(aes(x = continent, y = lifeExp)) +
  geom_boxplot()

4.9 Multiple geoms, multiple aes()

# (1)
gapdata2007 %>% 
  ggplot(aes(x = continent, y = lifeExp)) +
  geom_boxplot() +
  geom_point()

# (2)
gapdata2007 %>% 
  ggplot(aes(x = continent, y = lifeExp)) +
  geom_boxplot() +
  geom_jitter()

4.9 continued…

# (3)
gapdata2007 %>% 
  ggplot(aes(x = continent, y = lifeExp, colour = continent)) +
  geom_boxplot() +
  geom_jitter()

# (4)
gapdata2007 %>% 
  ggplot(aes(x = continent, y = lifeExp)) +
  geom_boxplot() +
  geom_jitter(aes(colour = continent))

4.9.1 Worked example - three geoms together

label_data <- gapdata2007 %>% 
  group_by(continent) %>% 
  filter(lifeExp == max(lifeExp)) %>% 
  select(country, continent, lifeExp)

# since we filtered for lifeExp == max(lifeExp)
# these are the maximum life expectancy countries at each continent:
label_data
## # A tibble: 5 x 3
## # Groups:   continent [5]
##   country   continent lifeExp
##   <fct>     <fct>       <dbl>
## 1 Australia Oceania      81.2
## 2 Canada    Americas     80.7
## 3 Iceland   Europe       81.8
## 4 Japan     Asia         82.6
## 5 Reunion   Africa       76.4
gapdata2007 %>% 
  ggplot(aes(x = continent, y = lifeExp)) +
  # First geom - boxplot
  geom_boxplot() +
  # Second geom - jitter with its own aes(colour = )
  geom_jitter(aes(colour = continent)) +
  # Third geom - label, with its own dataset (label_data) and aes(label = )
  geom_label(data = label_data, aes(label = country))

Try the suggested experiments given on page https://argoshare.is.ed.ac.uk/healthyr_book/multiple-geoms-multiple-aes.html with the R code above. (Copy-paste the above code chunk below to experiment with it!)

4.10 All other types of plots

4.11 Solutions

You will find solutions to Exercises 4.5.1 and 4.6.5 on the page https://argoshare.is.ed.ac.uk/healthyr_book/solutions.html

4.12 Extra: Advanced examples

(this is all extra material, if you are curious and have some time to check it!)

gapdata %>% 
  filter(continent == "Europe") %>% 
  ggplot(aes(y      = fct_reorder(country, lifeExp, .fun=max),
             x      = lifeExp,
             colour = year)) +
  geom_point(shape = 15, size = 2) +
  scale_colour_distiller(palette = "Greens", direction = 1) +
  theme_bw()

another example:

gapdata2007 %>% 
  group_by(continent) %>% 
  mutate(country_number = seq_along(country)) %>% 
  ggplot(aes(x = continent)) +
  geom_bar(aes(colour = continent), fill = NA, show.legend = FALSE) +
  geom_text(aes(y = country_number, label = country), vjust = 1)+
  geom_label(aes(label = continent), y = -1) +
  theme_void()

Great! That was it for Chapter 4.

The next chapter will take us further in fine tuning plots. It can be well considered EXTRA MATERIAL, so if you are curious, just go on and try it, too!

You may also skip Chapter 5 and proceed straight to Chapter 6.

You may come back in Chapter 5 anytime later, to learn the fine-tuning tricks!


Chapter 5: Fine tuning plots

https://argoshare.is.ed.ac.uk/healthyr_book/finetuning.html

5.1 Get the data

library(gapminder)
library(tidyverse)

p0 <- gapminder %>% 
  filter(year == 2007) %>% 
  ggplot(aes(y = lifeExp, x = gdpPercap, colour = continent)) +
  geom_point(alpha = 0.3) +
  theme_bw() +
  geom_smooth(method = "lm", se = FALSE) +
  scale_colour_brewer(palette = "Set1")

p0
## `geom_smooth()` using formula 'y ~ x'

5.2 Scales

5.2.1 Logarithmic

p1 <- p0 + scale_x_log10()

p1
## `geom_smooth()` using formula 'y ~ x'

5.2.2 Expand limits

p2 <- p0 + expand_limits(y = 0)

p2
## `geom_smooth()` using formula 'y ~ x'

p3 <- p0 + expand_limits(y = c(0, 100))

p3
## `geom_smooth()` using formula 'y ~ x'

p4 <- p0 +
  expand_limits(y = c(0, 100)) +
  coord_cartesian(expand = FALSE)

p4
## `geom_smooth()` using formula 'y ~ x'

# you may install the package also by selecting the command after '#' and activating it:
library(patchwork) # install.packages("patchwork")
p1 + p2 + p3 + p4 + plot_annotation(tag_levels = "1", tag_prefix = "p")
## `geom_smooth()` using formula 'y ~ x'
## `geom_smooth()` using formula 'y ~ x'
## `geom_smooth()` using formula 'y ~ x'
## `geom_smooth()` using formula 'y ~ x'

5.2.3 Zoom in

p5 <- p0 +
  coord_cartesian(ylim = c(70, 85), xlim = c(20000, 40000)) 

p5
## `geom_smooth()` using formula 'y ~ x'

5.2.4 Exercise

p6 <- p0 +
  scale_y_continuous(limits = c(70, 85)) +
  scale_x_continuous(limits = c(20000, 40000))

# Compare:

p5 + labs(tag = "p5") + p6 + labs(tag = "p6")
## `geom_smooth()` using formula 'y ~ x'
## `geom_smooth()` using formula 'y ~ x'
## Warning: Removed 114 rows containing non-finite values (stat_smooth).
## Warning: Removed 114 rows containing missing values (geom_point).

5.2.5 Axis ticks

# calculating the maximum value to be included in the axis breaks:
max_value = gapminder %>% 
  filter(year == 2007) %>%
  summarise(max_lifeExp = max(lifeExp)) %>% 
  pull(max_lifeExp) %>% 
  round(1)

# using scale_y_continuous(breaks = ...):
p7 <-  p0 +
  coord_cartesian(ylim = c(0, 100), expand = 0) +
  scale_y_continuous(breaks = c(18, 50, max_value))

# we may also include custom labels for our breaks:
p8 <-  p0 +
  coord_cartesian(ylim = c(0, 100), expand = 0) +
  scale_y_continuous(breaks = c(18, 50, max_value), labels = c("Adults", "50", "MAX"))

p7 + labs(tag = "p7") + p8 + labs(tag = "p8")
## `geom_smooth()` using formula 'y ~ x'
## `geom_smooth()` using formula 'y ~ x'

5.3 Colours

5.3.1 Using the Brewer palettes:

p9 <- p0 +
  scale_color_brewer(palette = "Paired")
## Scale for 'colour' is already present. Adding another scale for 'colour',
## which will replace the existing scale.

5.3.2 Legend title

p10 <- p0 +
  scale_color_brewer("Continent - \n one of 5", palette = "Paired")
## Scale for 'colour' is already present. Adding another scale for 'colour',
## which will replace the existing scale.
p9 + labs(tag = "p9") + p10 + labs(tag = "p10")
## `geom_smooth()` using formula 'y ~ x'
## `geom_smooth()` using formula 'y ~ x'

5.3.3 Choosing colours manually

p11 <- p0 +
  scale_color_manual(values = c("red", "green", "blue", "purple", "pink"))
## Scale for 'colour' is already present. Adding another scale for 'colour',
## which will replace the existing scale.
p12 <- p0 +
  scale_color_manual(values = c("#8dd3c7", "#ffffb3", "#bebada",
                                "#fb8072", "#80b1d3"))
## Scale for 'colour' is already present. Adding another scale for 'colour',
## which will replace the existing scale.
p11 + labs(tag = "p11") + p12 + labs(tag = "p12")
## `geom_smooth()` using formula 'y ~ x'
## `geom_smooth()` using formula 'y ~ x'

5.4 Titles and labels

p13 <- p0 +
  labs(x = "Gross domestic product per capita",
       y = "Life expectancy",
       title = "Health and economics",
       subtitle = "Gapminder dataset, 2007",
       caption = Sys.Date(),
       tag = "p13")

p13
## `geom_smooth()` using formula 'y ~ x'

5.4.1 Annotation

p14 <- p0 +
  annotate("text",
           x = 25000,
           y = 50,
           label = "No points here!")

p15 <- p0 +
  annotate("label",
           x = 25000,
           y = 50,
           label = "No points here!")

p16 <- p0 +
  annotate("label",
           x = 25000, 
           y = 50,
           label = "No points here!", 
           hjust = 0)

p14 + labs(tag = "p14") + (p15 + labs(tag = "p15"))/ (p16 + labs(tag = "p16"))
## `geom_smooth()` using formula 'y ~ x'
## `geom_smooth()` using formula 'y ~ x'
## `geom_smooth()` using formula 'y ~ x'

5.4.2 Annotation with a superscript and a variable

# a value we made up for this example
# a real analysis would get it from the linear model object
fit_glance <- tibble(r.squared = 0.7693465)


plot_rsquared <- paste0(
  "R^2 == ",
  fit_glance$r.squared %>% round(2))


p17 <- p0 +
  annotate("text",
           x = 25000, 
           y = 50,
           label = plot_rsquared, parse = TRUE,
           hjust = 0)

p17 + labs(tag = "p17")
## `geom_smooth()` using formula 'y ~ x'

5.5 Overall look - theme()

5.5.1 Text size

p18 <-  p0 +
  theme(axis.text.y = element_text(colour = "green", size = 14),
        axis.text.x = element_text(colour = "red",  angle = 45, vjust = 0.5),
        axis.title  = element_text(colour = "blue", size = 16)
        )

p18 + labs(tag = "p18")
## `geom_smooth()` using formula 'y ~ x'

5.5.2 Legend position

p19 <- p0 +
  theme(legend.position = "none")

p20 <- p0 +
  theme(legend.position      = c(1,0), #bottom-right corner
        legend.justification = c(1,0)) 

p19 + labs(tag = "p19") + p20 + labs(tag = "p20")
## `geom_smooth()` using formula 'y ~ x'
## `geom_smooth()` using formula 'y ~ x'

5.5.2 continued…

p21 <- p0 +
  guides(colour = guide_legend(ncol = 2)) +
  theme(legend.position = "top") # moving to the top optional

p21 + labs(tag = "p21")
## `geom_smooth()` using formula 'y ~ x'

5.6 Saving your plot

ggsave(p0, file = "my_saved_plot.pdf", width = 5, height = 4)
## `geom_smooth()` using formula 'y ~ x'
ggsave(p0, file = "my_saved_plot_larger.pdf", width = 10, height = 8)
## `geom_smooth()` using formula 'y ~ x'

Congrats - that was all for Chapter 5!

Chapters 6 and 8 prepare for the corresponding analysis chapters (7 and 9) in the RHDS book.

Both chapters (6 and 8) include useful things, for example, for plotting and data wrangling with continuous (Ch.6) and categorical (Ch.8) variables. Hence you should practice them through, too.


Chapter 6: Working with continuous outcome variables

https://argoshare.is.ed.ac.uk/healthyr_book/chap06-h1.html

6.1 Continuous data

6.2 The Question

6.3 Get and check the data

# Load packages (and install the finalfit package if not yet installed)
library(tidyverse)
library(finalfit) # install.packages("finalfit")
library(gapminder)

# Create object gapdata from object gapminder
gapdata <- gapminder

glimpse(gapdata) # each variable as line, variable type, first values
## Rows: 1,704
## Columns: 6
## $ country   <fct> "Afghanistan", "Afghanistan", "Afghanistan", "Afghanistan", ~
## $ continent <fct> Asia, Asia, Asia, Asia, Asia, Asia, Asia, Asia, Asia, Asia, ~
## $ year      <int> 1952, 1957, 1962, 1967, 1972, 1977, 1982, 1987, 1992, 1997, ~
## $ lifeExp   <dbl> 28.801, 30.332, 31.997, 34.020, 36.088, 38.438, 39.854, 40.8~
## $ pop       <int> 8425333, 9240934, 10267083, 11537966, 13079460, 14880372, 12~
## $ gdpPercap <dbl> 779.4453, 820.8530, 853.1007, 836.1971, 739.9811, 786.1134, ~
missing_glimpse(gapdata) # missing data for each variable
##               label var_type    n missing_n missing_percent
## country     country    <fct> 1704         0             0.0
## continent continent    <fct> 1704         0             0.0
## year           year    <int> 1704         0             0.0
## lifeExp     lifeExp    <dbl> 1704         0             0.0
## pop             pop    <int> 1704         0             0.0
## gdpPercap gdpPercap    <dbl> 1704         0             0.0
ff_glimpse(gapdata) # summary statistics for each variable
## $Continuous
##               label var_type    n missing_n missing_percent       mean
## year           year    <int> 1704         0             0.0     1979.5
## lifeExp     lifeExp    <dbl> 1704         0             0.0       59.5
## pop             pop    <int> 1704         0             0.0 29601212.3
## gdpPercap gdpPercap    <dbl> 1704         0             0.0     7215.3
##                    sd     min quartile_25    median quartile_75          max
## year             17.3  1952.0      1965.8    1979.5      1993.2       2007.0
## lifeExp          12.9    23.6        48.2      60.7        70.8         82.6
## pop       106157896.7 60011.0   2793664.0 7023595.5  19585221.8 1318683096.0
## gdpPercap      9857.5   241.2      1202.1    3531.8      9325.5     113523.1
## 
## $Categorical
##               label var_type    n missing_n missing_percent levels_n
## country     country    <fct> 1704         0             0.0      142
## continent continent    <fct> 1704         0             0.0        5
##                                                      levels
## country                                                   -
## continent "Africa", "Americas", "Asia", "Europe", "Oceania"
##                     levels_count               levels_percent
## country                        -                            -
## continent 624, 300, 396, 360, 24 36.6, 17.6, 23.2, 21.1,  1.4

6.4 Plot the data

6.4.1 Histogram

gapdata %>% 
  filter(year %in% c(2002, 2007)) %>%
  ggplot(aes(x = lifeExp)) +       # remember aes()
  geom_histogram(bins = 20) +      # histogram with 20 bars
  facet_grid(year ~ continent)     # optional: add scales="free"                                 

6.4.2 Quantile-quantile (Q-Q) plot

gapdata %>% 
  filter(year %in% c(2002, 2007)) %>%
  ggplot(aes(sample = lifeExp)) +      # Q-Q plot requires 'sample'
  geom_qq() +                          # defaults to normal distribution
  geom_qq_line(colour = "blue") +      # add the theoretical line
  facet_grid(year ~ continent)

6.4.3 Boxplot

gapdata %>% 
  filter(year %in% c(2002, 2007)) %>%
  ggplot(aes(x = continent, y = lifeExp)) +
  geom_boxplot() +
  facet_wrap(~ year)

6.4.3 continued…

gapdata %>%  
  filter(year %in% c(2002, 2007)) %>%
  ggplot(aes(x = factor(year), y = lifeExp)) +
  geom_boxplot(aes(fill = continent)) +     # add colour to boxplots
  geom_jitter(alpha = 0.4) +                # alpha = transparency
  facet_wrap(~ continent, ncol = 5) +       # spread by continent
  theme(legend.position = "none") +         # remove legend
  xlab("Year") +                            # label x-axis
  ylab("Life expectancy (years)") +         # label y-axis
  ggtitle(
    "Life expectancy by continent in 2002 v 2007") # add title

6.5 Compare the means of two groups

6.5.1 t-test

6.5.2 Two-sample t-tests

ttest_data <- gapdata %>%                    # save as object ttest_data
  filter(year == 2007) %>%                   # 2007 only
  filter(continent %in% c("Asia", "Europe")) # Asia/Europe only

ttest_result <- ttest_data %>%               # example using pipe
  t.test(lifeExp ~ continent, data = .)      # note data = ., see below
ttest_result
## 
##  Welch Two Sample t-test
## 
## data:  lifeExp by continent
## t = -4.6468, df = 41.529, p-value = 3.389e-05
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -9.926525 -3.913705
## sample estimates:
##   mean in group Asia mean in group Europe 
##             70.72848             77.64860
ttest_result$p.value # Extracted element of result object
## [1] 3.38922e-05
ttest_result$conf.int # Extracted element of result object
## [1] -9.926525 -3.913705
## attr(,"conf.level")
## [1] 0.95

6.5.3 Paired t-tests

paired_data <- gapdata %>%             # save as object paired_data
  filter(year %in% c(2002, 2007)) %>%  # 2002 and 2007 only
  filter(continent == "Asia")          # Asia only

paired_data %>%      
  ggplot(aes(x = year, y = lifeExp, 
             group = country)) +       # for individual country lines
  geom_line()

6.5.3 continued…

paired_table <- paired_data %>%        # save object paired_data
  select(country, year, lifeExp) %>%   # select vars interest
  pivot_wider(names_from = year,       # put years in columns
              values_from = lifeExp) %>% 
  mutate(
    dlifeExp = `2007` - `2002`         # difference in means
  )
paired_table
## # A tibble: 33 x 4
##    country          `2002` `2007` dlifeExp
##    <fct>             <dbl>  <dbl>    <dbl>
##  1 Afghanistan        42.1   43.8    1.70 
##  2 Bahrain            74.8   75.6    0.840
##  3 Bangladesh         62.0   64.1    2.05 
##  4 Cambodia           56.8   59.7    2.97 
##  5 China              72.0   73.0    0.933
##  6 Hong Kong, China   81.5   82.2    0.713
##  7 India              62.9   64.7    1.82 
##  8 Indonesia          68.6   70.6    2.06 
##  9 Iran               69.5   71.0    1.51 
## 10 Iraq               57.0   59.5    2.50 
## # ... with 23 more rows
# Mean of difference in years
paired_table %>% summarise( mean(dlifeExp) )
## # A tibble: 1 x 1
##   `mean(dlifeExp)`
##              <dbl>
## 1             1.49
paired_data %>% 
  t.test(lifeExp ~ year, data = ., paired = TRUE)
## 
##  Paired t-test
## 
## data:  lifeExp by year
## t = -14.338, df = 32, p-value = 1.758e-15
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -1.706941 -1.282271
## sample estimates:
## mean of the differences 
##               -1.494606

6.5.4 What if I run the wrong test?

6.6 Compare the mean of one group: one sample t-tests

# the tidy() function comes from the package broom - install it first!
library(broom) # install.packages("broom")
## Warning: package 'broom' was built under R version 4.0.5
gapdata %>% 
  filter(year == 2007) %>%          # 2007 only
  group_by(continent) %>%           # split by continent
  do(                               # dplyr function
    t.test(.$lifeExp, mu = 77) %>%  # compare mean to 77 years 
      tidy()                        # tidy into tibble
  )
## # A tibble: 5 x 9
## # Groups:   continent [5]
##   continent estimate statistic  p.value parameter conf.low conf.high method     
##   <fct>        <dbl>     <dbl>    <dbl>     <dbl>    <dbl>     <dbl> <chr>      
## 1 Africa        54.8    -16.6  3.15e-22        51     52.1      57.5 One Sample~
## 2 Americas      73.6     -3.82 8.32e- 4        24     71.8      75.4 One Sample~
## 3 Asia          70.7     -4.52 7.88e- 5        32     67.9      73.6 One Sample~
## 4 Europe        77.6      1.19 2.43e- 1        29     76.5      78.8 One Sample~
## 5 Oceania       80.7      7.22 8.77e- 2         1     74.2      87.3 One Sample~
## # ... with 1 more variable: alternative <chr>

6.6.1 Interchangeability of t-tests

# note that we're using dlifeExp
# so the differences we calculated above
t.test(paired_table$dlifeExp, mu = 0)
## 
##  One Sample t-test
## 
## data:  paired_table$dlifeExp
## t = 14.338, df = 32, p-value = 1.758e-15
## alternative hypothesis: true mean is not equal to 0
## 95 percent confidence interval:
##  1.282271 1.706941
## sample estimates:
## mean of x 
##  1.494606

6.7 Compare the means of more than two groups

6.7.1 Plot the data

gapdata %>% 
  filter(year == 2007) %>% 
  filter(continent %in% 
           c("Americas", "Europe", "Asia")) %>% 
  ggplot(aes(x = continent, y=lifeExp)) +
  geom_boxplot()

6.7.2 ANOVA

aov_data <- gapdata %>% 
  filter(year == 2007) %>% 
  filter(continent %in% c("Americas", "Europe", "Asia"))

fit = aov(lifeExp ~ continent, data = aov_data) 
summary(fit)
##             Df Sum Sq Mean Sq F value   Pr(>F)    
## continent    2  755.6   377.8   11.63 3.42e-05 ***
## Residuals   85 2760.3    32.5                     
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

6.7.2 continued…

library(broom) # install.packages("broom")
gapdata %>% 
  filter(year == 2007) %>% 
  filter(continent %in% c("Americas", "Europe", "Asia")) %>% 
  aov(lifeExp~continent, data = .) %>% 
  tidy()
## # A tibble: 2 x 6
##   term         df sumsq meansq statistic    p.value
##   <chr>     <dbl> <dbl>  <dbl>     <dbl>      <dbl>
## 1 continent     2  756.  378.       11.6  0.0000342
## 2 Residuals    85 2760.   32.5      NA   NA

6.7.3 Assumptions

library(ggfortify) # install.packages("ggfortify")
## Warning: package 'ggfortify' was built under R version 4.0.5
autoplot(fit)

6.8 Multiple testing

6.8.1 Pairwise testing and multiple comparisons

pairwise.t.test(aov_data$lifeExp, aov_data$continent, 
                p.adjust.method = "bonferroni")
## 
##  Pairwise comparisons using t tests with pooled SD 
## 
## data:  aov_data$lifeExp and aov_data$continent 
## 
##        Americas Asia   
## Asia   0.180    -      
## Europe 0.031    1.9e-05
## 
## P value adjustment method: bonferroni
pairwise.t.test(aov_data$lifeExp, aov_data$continent, 
                p.adjust.method = "fdr")
## 
##  Pairwise comparisons using t tests with pooled SD 
## 
## data:  aov_data$lifeExp and aov_data$continent 
## 
##        Americas Asia   
## Asia   0.060    -      
## Europe 0.016    1.9e-05
## 
## P value adjustment method: fdr

6.9 Non-parametric tests

6.9.1 Transforming data

africa2002 <- gapdata %>%              # save as africa2002
  filter(year == 2002) %>%             # only 2002
  filter(continent == "Africa") %>%    # only Africa
  select(country, lifeExp) %>%         # only these variables
  mutate(
    lifeExp_log = log(lifeExp)         # log life expectancy
  )
head(africa2002)                       # inspect
## # A tibble: 6 x 3
##   country      lifeExp lifeExp_log
##   <fct>          <dbl>       <dbl>
## 1 Algeria         71.0        4.26
## 2 Angola          41.0        3.71
## 3 Benin           54.4        4.00
## 4 Botswana        46.6        3.84
## 5 Burkina Faso    50.6        3.92
## 6 Burundi         47.4        3.86
africa2002 %>% 
  # pivot lifeExp and lifeExp_log values to same column (for easy plotting):
  pivot_longer(contains("lifeExp")) %>% 
  ggplot(aes(x = value)) +             
  geom_histogram(bins = 15) +          # make histogram
  facet_wrap(~name, scales = "free")   # facet with axes free to vary

6.9.2 Non-parametric test for comparing two groups

africa_data <- gapdata %>%                          
  filter(year %in% c(1982, 2007)) %>%      # only 1982 and 2007
  filter(continent %in% c("Africa"))       # only Africa

p1 <- africa_data %>%                      # save plot as p1
  ggplot(aes(x = lifeExp)) + 
  geom_histogram(bins = 15) +
  facet_wrap(~year)

p2 <- africa_data %>%                      # save plot as p2
  ggplot(aes(sample = lifeExp)) +          # `sample` for Q-Q plot
  geom_qq() + 
  geom_qq_line(colour = "blue") + 
  facet_wrap(~year)

p3 <- africa_data %>%                      # save plot as p3
  ggplot(aes(x = factor(year),             # try without factor(year) to
             y = lifeExp)) +               # see the difference
  geom_boxplot(aes(fill = factor(year))) + # colour boxplot
  geom_jitter(alpha = 0.4) +               # add data points
  theme(legend.position = "none")          # remove legend

library(patchwork) # install.packages("patchwork") # great for combining plots
p1 / p2 | p3

africa_data %>% 
  wilcox.test(lifeExp ~ year, data = .)
## 
##  Wilcoxon rank sum test with continuity correction
## 
## data:  lifeExp by year
## W = 1130, p-value = 0.1499
## alternative hypothesis: true location shift is not equal to 0

6.9.3 Non-parametric test for comparing more than two groups

library(broom)
gapdata %>% 
  filter(year == 2007) %>% 
  filter(continent %in% c("Americas", "Europe", "Asia")) %>% 
  kruskal.test(lifeExp~continent, data = .) %>% 
  tidy()
## # A tibble: 1 x 4
##   statistic   p.value parameter method                      
##       <dbl>     <dbl>     <int> <chr>                       
## 1      21.6 0.0000202         2 Kruskal-Wallis rank sum test

6.10 Finalfit approach

dependent <- "year"
explanatory <- c("lifeExp", "pop", "gdpPercap")
africa_data %>%         
  mutate(
    year = factor(year)
  ) %>% 
  summary_factorlist(dependent, explanatory,
                     cont = "median", p = TRUE)
##      label       levels                               1982
##    lifeExp Median (IQR)                50.8 (45.6 to 56.6)
##        pop Median (IQR) 5668228.5 (1569553.8 to 9788207.8)
##  gdpPercap Median (IQR)           1323.7 (828.7 to 2787.6)
##                                  2007     p
##                   52.9 (47.8 to 59.4) 0.149
##  10093310.5 (2909226.5 to 19363654.5) 0.033
##              1452.3 (863.0 to 3993.5) 0.503

6.10 continued…

dependent <- "year"
explanatory <- c("lifeExp", "pop", "gdpPercap")
africa_data %>%         
  mutate(
    year = factor(year)
  ) %>% 
  summary_factorlist(dependent, explanatory,
                     cont_nonpara =  c(1, 3),         # variable 1&3 are non-parametric
                     cont_range = TRUE,               # lower and upper quartile
                     p = TRUE,                        # include hypothesis test
                     p_cont_para = "t.test",          # use t.test/aov for parametric
                     add_row_totals = TRUE,           # row totals
                     include_row_missing_col = FALSE, # missing values row totals
                     add_dependent_label = TRUE)      # dependent label 
##  Dependent: year     Total N                                  1982
##          lifeExp 104 (100.0) Median (IQR)      50.8 (45.6 to 56.6)
##              pop 104 (100.0)    Mean (SD)   9602857.4 (13456243.4)
##        gdpPercap 104 (100.0) Median (IQR) 1323.7 (828.7 to 2787.6)
##                      2007     p
##       52.9 (47.8 to 59.4) 0.149
##   17875763.3 (24917726.2) 0.038
##  1452.3 (863.0 to 3993.5) 0.503

6.11 Conclusions

6.12 Exercises

See page https://argoshare.is.ed.ac.uk/healthyr_book/exercises-1.html

You may try yourself, and then check the solutions (link below)!

6.12.1 Exercise

6.12.2 Exercise

6.12.3 Exercise

6.12.4 Exercise

6.13 Solutions

Solutions to the above exercises!

https://argoshare.is.ed.ac.uk/healthyr_book/solutions-1.html

Great job! Chapter 6 DONE. Next: Chapter 8!


Chapter 8: Working with categorical outcome variables

https://argoshare.is.ed.ac.uk/healthyr_book/chap08-h1.html

8.1 Factors

8.2 The Question

8.3 Get the data

# Get the data from the boot package (that includes tools for bootstrapping methods):
meldata <- boot::melanoma # Survival from Malignant Melanoma

8.4 Check the data

library(tidyverse)
library(finalfit)
theme_set(theme_bw())
meldata %>% glimpse()
## Rows: 205
## Columns: 7
## $ time      <dbl> 10, 30, 35, 99, 185, 204, 210, 232, 232, 279, 295, 355, 386,~
## $ status    <dbl> 3, 3, 2, 3, 1, 1, 1, 3, 1, 1, 1, 3, 1, 1, 1, 3, 1, 1, 1, 1, ~
## $ sex       <dbl> 1, 1, 1, 0, 1, 1, 1, 0, 1, 0, 0, 0, 0, 1, 0, 1, 1, 1, 1, 1, ~
## $ age       <dbl> 76, 56, 41, 71, 52, 28, 77, 60, 49, 68, 53, 64, 68, 63, 14, ~
## $ year      <dbl> 1972, 1968, 1977, 1968, 1965, 1971, 1972, 1974, 1968, 1971, ~
## $ thickness <dbl> 6.76, 0.65, 1.34, 2.90, 12.08, 4.84, 5.16, 3.22, 12.88, 7.41~
## $ ulcer     <dbl> 1, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ~
meldata %>% ff_glimpse()
## $Continuous
##               label var_type   n missing_n missing_percent   mean     sd    min
## time           time    <dbl> 205         0             0.0 2152.8 1122.1   10.0
## status       status    <dbl> 205         0             0.0    1.8    0.6    1.0
## sex             sex    <dbl> 205         0             0.0    0.4    0.5    0.0
## age             age    <dbl> 205         0             0.0   52.5   16.7    4.0
## year           year    <dbl> 205         0             0.0 1969.9    2.6 1962.0
## thickness thickness    <dbl> 205         0             0.0    2.9    3.0    0.1
## ulcer         ulcer    <dbl> 205         0             0.0    0.4    0.5    0.0
##           quartile_25 median quartile_75    max
## time           1525.0 2005.0      3042.0 5565.0
## status            1.0    2.0         2.0    3.0
## sex               0.0    0.0         1.0    1.0
## age              42.0   54.0        65.0   95.0
## year           1968.0 1970.0      1972.0 1977.0
## thickness         1.0    1.9         3.6   17.4
## ulcer             0.0    0.0         1.0    1.0
## 
## $Categorical
## data frame with 0 columns and 205 rows

8.5 Recode the data

meldata <- meldata %>% 
  mutate(sex.factor =             # Make new variable  
           sex %>%                # from existing variable
           factor() %>%           # convert to factor
           fct_recode(            # forcats function
             "Female" = "0",      # new on left, old on right
             "Male"   = "1") %>% 
           ff_label("Sex"),       # Optional label for finalfit
         
         # same thing but more condensed code:
         ulcer.factor = factor(ulcer) %>% 
           fct_recode("Present" = "1",
                      "Absent"  = "0") %>% 
           ff_label("Ulcerated tumour"),
         
         status.factor = factor(status) %>% 
           fct_recode("Died melanoma"       = "1",
                      "Alive"               = "2",
                      "Died - other causes" = "3") %>% 
           ff_label("Status"))

#View(meldata) # take a look at the data!

8.6 Should I convert a continuous variable to a categorical variable?

# Summary of age
meldata$age %>% 
  summary()
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    4.00   42.00   54.00   52.46   65.00   95.00
meldata %>% 
  ggplot(aes(x = age)) + 
  geom_histogram()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

8.6.1 Equal intervals vs quantiles

meldata <- meldata %>% 
  mutate(
    age.factor = 
      age %>%
      cut(4)
  )

meldata$age.factor %>%
  summary()
## (3.91,26.8] (26.8,49.5] (49.5,72.2] (72.2,95.1] 
##          16          68         102          19

8.6.1 continued…

meldata <- meldata %>% 
  mutate(
    age.factor = 
      age %>%
      Hmisc::cut2(g=4) # Note, cut2 comes from the Hmisc package
  )

meldata$age.factor %>% 
  summary()
## [ 4,43) [43,55) [55,66) [66,95] 
##      55      49      53      48
#View(meldata) # take a look at the data!

8.6.1 continued…

meldata <- meldata %>% 
  mutate(
    age.factor = 
      age %>%
      cut(breaks = c(4,20,40,60,95), include.lowest = TRUE) %>% 
      fct_recode(
        "≤20"      =  "[4,20]",
        "21 to 40" = "(20,40]",
        "41 to 60" = "(40,60]",
        ">60"      = "(60,95]"
      ) %>% 
      ff_label("Age (years)")
  )
head(meldata$age.factor)
## [1] >60      41 to 60 41 to 60 >60      41 to 60 21 to 40
## Levels: =20 21 to 40 41 to 60 >60
#View(meldata) # take a look at the data!

8.7 Plot the data

p1 <- meldata %>% 
  ggplot(aes(x = ulcer.factor, fill = status.factor)) + 
  geom_bar() + 
  theme(legend.position = "none")

p2 <- meldata %>% 
  ggplot(aes(x = ulcer.factor, fill = status.factor)) + 
  geom_bar(position = "fill") + 
  ylab("proportion")

library(patchwork)
p1 + p2

8.7 continued…

p1 <- meldata %>% 
  ggplot(aes(x = ulcer.factor, fill = status.factor)) + 
  geom_bar(position = position_stack(reverse = TRUE)) + 
  theme(legend.position = "none")

p2 <- meldata %>% 
  ggplot(aes(x = ulcer.factor, fill = status.factor)) + 
  geom_bar(position = position_fill(reverse = TRUE)) + 
  ylab("proportion")

library(patchwork)
p1 + p2

8.7 continued…

p1 <- meldata %>% 
  ggplot(aes(x = ulcer.factor, fill=status.factor)) + 
  geom_bar(position = position_stack(reverse = TRUE)) +
  facet_grid(sex.factor ~ age.factor) + 
  theme(legend.position = "none")

p2 <- meldata %>% 
  ggplot(aes(x = ulcer.factor, fill=status.factor)) + 
  geom_bar(position = position_fill(reverse = TRUE)) +
  facet_grid(sex.factor ~ age.factor)+ 
  theme(legend.position = "bottom") +
  ylab("proportion") # this line was missing in the book

p1 / p2

8.8 Group factor levels together - fct_collapse()

meldata <- meldata %>%
  mutate(
    status_dss = fct_collapse( # dss - disease specific survival
      status.factor,
      "Alive" = c("Alive", "Died - other causes"))
  )

#View(meldata) # take a look at the data!

8.9 Change the order of values within a factor - fct_relevel()

# dss - disease specific survival
meldata$status_dss %>% levels()
## [1] "Died melanoma" "Alive"
meldata <- meldata %>% 
  mutate(status_dss = status_dss %>%
           fct_relevel("Alive")
         )

meldata$status_dss %>% levels()
## [1] "Alive"         "Died melanoma"

8.10 Summarising factors with finalfit

library(finalfit)
meldata %>% 
  summary_factorlist(dependent   = "status_dss", 
                     explanatory = "ulcer.factor")
##             label  levels     Alive Died melanoma
##  Ulcerated tumour  Absent 99 (66.9)     16 (28.1)
##                   Present 49 (33.1)     41 (71.9)

8.10 continued…

library(finalfit)
meldata %>% 
  summary_factorlist(dependent = "status_dss", 
                     explanatory = 
                       c("ulcer.factor", "age.factor", 
                         "sex.factor", "thickness")
  )
##             label    levels     Alive Died melanoma
##  Ulcerated tumour    Absent 99 (66.9)     16 (28.1)
##                     Present 49 (33.1)     41 (71.9)
##       Age (years)       =20   6 (4.1)       3 (5.3)
##                    21 to 40 30 (20.3)      7 (12.3)
##                    41 to 60 66 (44.6)     26 (45.6)
##                         >60 46 (31.1)     21 (36.8)
##               Sex    Female 98 (66.2)     28 (49.1)
##                        Male 50 (33.8)     29 (50.9)
##         thickness Mean (SD) 2.4 (2.5)     4.3 (3.6)

8.11 Pearson’s chi-squared and Fisher’s exact tests

8.11.1 Base R

table(meldata$ulcer.factor, meldata$status_dss) 
##          
##           Alive Died melanoma
##   Absent     99            16
##   Present    49            41
# both give same result

with(meldata, table(ulcer.factor, status_dss))
##             status_dss
## ulcer.factor Alive Died melanoma
##      Absent     99            16
##      Present    49            41

8.11.1 continued…

library(magrittr)
## 
## Attaching package: 'magrittr'
## The following object is masked from 'package:purrr':
## 
##     set_names
## The following object is masked from 'package:tidyr':
## 
##     extract
meldata %$%        # note $ sign here
  table(ulcer.factor, status_dss)
##             status_dss
## ulcer.factor Alive Died melanoma
##      Absent     99            16
##      Present    49            41
meldata %$%
  table(ulcer.factor, status_dss) %>% 
  prop.table(margin = 1)     # 1: row, 2: column etc.
##             status_dss
## ulcer.factor     Alive Died melanoma
##      Absent  0.8608696     0.1391304
##      Present 0.5444444     0.4555556
meldata %$%        # note $ sign here
  table(ulcer.factor, status_dss) %>% 
  chisq.test()
## 
##  Pearson's Chi-squared test with Yates' continuity correction
## 
## data:  .
## X-squared = 23.631, df = 1, p-value = 1.167e-06
library(broom)
meldata %$%        # note $ sign here
  table(ulcer.factor, status_dss) %>% 
  chisq.test() %>% 
  tidy()
## # A tibble: 1 x 4
##   statistic    p.value parameter method                                         
##       <dbl>      <dbl>     <int> <chr>                                          
## 1      23.6 0.00000117         1 Pearson's Chi-squared test with Yates' continu~

8.12 Fisher’s exact test

meldata %$%        # note $ sign here
  table(age.factor, status_dss) %>% 
  chisq.test()
## Warning in chisq.test(.): Chi-squared approximation may be incorrect
## 
##  Pearson's Chi-squared test
## 
## data:  .
## X-squared = 2.0198, df = 3, p-value = 0.5683
meldata %$%        # note $ sign here
  table(age.factor, status_dss) %>% 
  fisher.test()
## 
##  Fisher's Exact Test for Count Data
## 
## data:  .
## p-value = 0.5437
## alternative hypothesis: two.sided

8.13 Chi-squared / Fisher’s exact test using finalfit

library(finalfit)
meldata %>% 
  summary_factorlist(dependent   = "status_dss", 
                     explanatory = "ulcer.factor",
                     p = TRUE)
##             label  levels     Alive Died melanoma      p
##  Ulcerated tumour  Absent 99 (66.9)     16 (28.1) <0.001
##                   Present 49 (33.1)     41 (71.9)
meldata %>% 
  summary_factorlist(dependent = "status_dss", 
                     explanatory = 
                       c("ulcer.factor", "age.factor", 
                         "sex.factor", "thickness"),
                     p = TRUE)
## Warning in chisq.test(age.factor, status_dss): Chi-squared approximation may be
## incorrect
##             label    levels     Alive Died melanoma      p
##  Ulcerated tumour    Absent 99 (66.9)     16 (28.1) <0.001
##                     Present 49 (33.1)     41 (71.9)       
##       Age (years)       =20   6 (4.1)       3 (5.3)  0.568
##                    21 to 40 30 (20.3)      7 (12.3)       
##                    41 to 60 66 (44.6)     26 (45.6)       
##                         >60 46 (31.1)     21 (36.8)       
##               Sex    Female 98 (66.2)     28 (49.1)  0.036
##                        Male 50 (33.8)     29 (50.9)       
##         thickness Mean (SD) 2.4 (2.5)     4.3 (3.6) <0.001
meldata %>% 
  summary_factorlist(dependent = "status_dss", 
                     explanatory = 
                       c("ulcer.factor", "age.factor", 
                         "sex.factor", "thickness"),
                     p = TRUE,
                     p_cat = "fisher")
##             label    levels     Alive Died melanoma      p
##  Ulcerated tumour    Absent 99 (66.9)     16 (28.1) <0.001
##                     Present 49 (33.1)     41 (71.9)       
##       Age (years)       =20   6 (4.1)       3 (5.3)  0.544
##                    21 to 40 30 (20.3)      7 (12.3)       
##                    41 to 60 66 (44.6)     26 (45.6)       
##                         >60 46 (31.1)     21 (36.8)       
##               Sex    Female 98 (66.2)     28 (49.1)  0.026
##                        Male 50 (33.8)     29 (50.9)       
##         thickness Mean (SD) 2.4 (2.5)     4.3 (3.6) <0.001

(The code chunk in 8.13, below the text “Further options can be included” does not work…)

8.14 Exercises

8.14.1 Exercise

8.14.2 Exercise

meldata %>%
  count(ulcer.factor, status.factor) %>%
  group_by(status.factor) %>%
  mutate(total = sum(n)) %>%
  mutate(percentage = round(100*n/total, 1)) %>% 
  mutate(count_perc = paste0(n, " (", percentage, ")")) %>% 
  select(-total, -n, -percentage) %>% 
  spread(status.factor, count_perc)
## # A tibble: 2 x 4
##   ulcer.factor `Died melanoma` Alive     `Died - other causes`
##   <fct>        <chr>           <chr>     <chr>                
## 1 Absent       16 (28.1)       92 (68.7) 7 (50)               
## 2 Present      41 (71.9)       42 (31.3) 7 (50)

8.14.3 Exercise

(Solutions to these exercises are not included in the book.)

Gooood job! Chapter 8 DONE.